Introduction to Supervised Learning in Machine Learning
In this post we'll go into the concept of supervised learning, the requirements for machines to learn, and the process of learning and enhancing prediction accuracy.
What is Supervised Learning
When it comes to machine learning, there are primarily four types:
- Supervised Machine Learning
- Unsupervised Machine Learning
- Semi-Supervised Machine Learning
- Reinforcement Learning
Supervised machine learning refers to the process of training a machine using labeled data. Labeled data can consist of numeric or string values. For example, imagine that you have photos of animals, such as cats and dogs. To train your machine to recognize the animal, you’ll need to “label” or indicate the name of the animal alongside each animal image. The machine will then learn to pick up similar patterns in photos and predict the appropriate label.
Machine Learning
Machine Learning is a term that refers to the process that a machine undergoes so that it can produce predictions. As mentioned, a machine can identify a cat in a photo even if it has never seen this particular cat. But how?
Through training of course, which involves a recursive process that improves output (or prediction) accuracy. In supervised machine learning, we teach the machine to identify things based on the labeled data we give it.
Today, you see and interact with trained machines everywhere. Netflix, YouTube, TikTok, and most services implement some kind of algorithm that uses your data (which was collected from you) to learn about you so it can give you things you like. That's why you spend endless hours scrolling.
Supervised vs Unsupervised
Think of it like this:
As humans, we recognize a cat as a cat because we have been taught what a cat looks like by our parents and teachers. They basically "supervised" us and "labeled" our data. However, when we classify good and bad friends, we rely on our personal experiences and observations to achieve this. Similarly, machines can learn through supervised learning, where they are taught to recognize specific images, or through unsupervised learning, where they make their own judgments based on the data provided to them.
Compared to us, the learning process of a machine is different but it was inspired by our brains. To train computers, we'll mostly use statistical algorithms like Linear Regression, Decision Trees (DTs), and K-nearest neighbours (KNNs).
5 things you'll need to train your model
Understand the problem
First, you'll need to understand the problem that you're trying to solve. Usually, we can use machine learning to answer a broad range of questions, things like:
- Can we accurately predict diseases in patients?
- Can we predict the price of houses?
It's important to understand the question we're trying to answer. Let's take the first question from the list above:
"Can we accurately predict diseases in patients?"
We can rephrase this question to:
"Is it possible to utilize historical patient data such as age, gender, blood pressure, cholesterol, and medical conditions to predict the likelihood of a new patient developing a disease?"
The answer is: Yes. This is known as a classification problem, where the input data is used to predict the patient’s potential to develop a new disease based on a list of predetermined categories.
Get and prepare the data
Imagine you buy a textbook for a math class, and all the papers are blank. Or better yet, imagine the papers include random information, not related to the topic or even unrecognizable characters. Would you be able to learn anything? Of course not. You'll need organized information. Similarly, we'll need to prepare our data before we can use it.
So, the next step is to get the data. It could be located in many places, like:
- Hospital internal database (SQL)
- Publicly available information (Web Scraping)
- Public health records (JSON)
As you can see, the data could be in multiple locations and in many shapes and formats. As long as the data is relevant to our problem, we can make use of it.
Let's assume we were able to pull all relevant data, and performed data wrangling to obtain something like this:
ID | Age | Gender | Blood P. | Cholesterol | Symptoms | Wikipedia R. | Health Records R. | Disease |
---|---|---|---|---|---|---|---|---|
1 | 45 | Male | High | High | Chest pain | High | Low | Coronary artery disease |
2 | 32 | Female | Normal | Normal | Fatigue, Headache | Low | High | Migraine |
3 | 68 | Male | High | High | Shortness of breath | High | High | Chronic obstructive pulmonary disease |
4 | 55 | Female | Normal | High | Abdominal pain, Nausea | Low | High | Gastritis |
5 | 50 | Male | High | Normal | Frequent urination, Excessive thirst | Low | Low | Diabetes |
This is just an example of course. The data could be in many different formats and include many other columns and outputs. The R. in Wikipedia R. and Health Records R. refer to Relevance.
Explore and analyze the data
Now that we're working with clean data, it's important to take a closer look and perform what is referred to as Explanatory Data Analysis (EDA) to find patterns and summarize the main characteristics. For example, to understand the distribution of our dataset we can calculate the mean, median, and range of our age variable. We can also analyze the correlation between disease and gender by calculating the percentage of a disease for a specific gender.
We will not go into technical details in this post, but some common analyses done during the EDA phase include:
- Data Distribution
- Dataset Structure
- Handle Missing Values and Outliers
- Determine Correlations
- Evaluate Assumptions
- Visualize by Plotting
- Identify Patterns
- Understand the Relevancy of External Data
Choose a suitable algorithm
As we've seen earlier, we have a classification problem. We can therefore build model candidates using common classification algorithms and then compare outputs to choose the most accurate.
For this example, I'm going to use two algorithms popular for solving classification problems:
- Random Forest
- Support Vector Machine (SVM)
Train, test, and refine
Using Python and scikit-learn (a machine learning library for Python), we can determine the accuracy of both algorithms given our dataset. We'll train the model by giving it a piece of the data.
The output is as shown below:
Looking at SVM vs. Random Forest accuracy results, we'll choose Random Forest since it has an accuracy of 85% vs. just 28% for SVM.
Obviously, you can try other algorithms until you're satisfied with the outputs based on your criteria and the problem you're trying to solve.
The above is essentially what goes into the Supervised Machine Learning process. It's important to highlight that this is an iterative process and does not end after training. We need to deploy the model and acquire feedback from stakeholders which could lead to model refinement based on new data and other factors.
Conclusion
Thanks for reading! In this post, we covered what supervised machine learning is, what machines need to learn, how they learn, and how they improve. We also covered important steps such as Data Wrangling and EDA that are absolutely crucial in the prediction accuracy and relevancy of your model.
If you want to learn more, I have covered the most commonly used algorithms in supervised machine learning problems. These include Simple Linear Regression, Multiple Linear Regression, and their implementation using Python.