Machine Learning

Introduction to Supervised Learning in Machine Learning

In this post we'll go into the concept of supervised learning, the requirements for machines to learn, and the process of learning and enhancing prediction accuracy.

jeff

May 29, 2023 — 5 min read

What is Supervised Learning

When it comes to machine learning, there are primarily four types:

Supervised Machine Learning
Unsupervised Machine Learning
Semi-Supervised Machine Learning
Reinforcement Learning

Supervised machine learning refers to the process of training a machine using labeled data. Labeled data can consist of numeric or string values. For example, imagine that you have photos of animals, such as cats and dogs. To train your machine to recognize the animal, you’ll need to “label” or indicate the name of the animal alongside each animal image. The machine will then learn to pick up similar patterns in photos and predict the appropriate label.

Machine Learning

Machine Learning is a term that refers to the process that a machine undergoes so that it can produce predictions. As mentioned, a machine can identify a cat in a photo even if it has never seen this particular cat. But how?

Through training of course, which involves a recursive process that improves output (or prediction) accuracy. In supervised machine learning, we teach the machine to identify things based on the labeled data we give it.

Today, you see and interact with trained machines everywhere. Netflix, YouTube, TikTok, and most services implement some kind of algorithm that uses your data (which was collected from you) to learn about you so it can give you things you like. That's why you spend endless hours scrolling.

⚠️

The more data you give these services, the more they learn about you. Some of them even know you more than you know yourself.

Supervised vs Unsupervised

Think of it like this:

As humans, we recognize a cat as a cat because we have been taught what a cat looks like by our parents and teachers. They basically "supervised" us and "labeled" our data. However, when we classify good and bad friends, we rely on our personal experiences and observations to achieve this. Similarly, machines can learn through supervised learning, where they are taught to recognize specific images, or through unsupervised learning, where they make their own judgments based on the data provided to them.

Compared to us, the learning process of a machine is different but it was inspired by our brains. To train computers, we'll mostly use statistical algorithms like Linear Regression, Decision Trees (DTs), and K-nearest neighbours (KNNs).

💡

An algorithm is a sequence of operations that is typically used by computers to find the correct solution to a problem (or identify that there are no correct solutions).

5 things you'll need to train your model

Understand the problem

First, you'll need to understand the problem that you're trying to solve. Usually, we can use machine learning to answer a broad range of questions, things like:

Can we accurately predict diseases in patients?
Can we predict the price of houses?

It's important to understand the question we're trying to answer. Let's take the first question from the list above:

"Can we accurately predict diseases in patients?"

We can rephrase this question to:

"Is it possible to utilize historical patient data such as age, gender, blood pressure, cholesterol, and medical conditions to predict the likelihood of a new patient developing a disease?"

The answer is: Yes. This is known as a classification problem, where the input data is used to predict the patient’s potential to develop a new disease based on a list of predetermined categories.

Get and prepare the data

Imagine you buy a textbook for a math class, and all the papers are blank. Or better yet, imagine the papers include random information, not related to the topic or even unrecognizable characters. Would you be able to learn anything? Of course not. You'll need organized information. Similarly, we'll need to prepare our data before we can use it.

💡

The quality of your data will determine the quality of your predictions.

So, the next step is to get the data. It could be located in many places, like:

Hospital internal database (SQL)
Publicly available information (Web Scraping)
Public health records (JSON)

As you can see, the data could be in multiple locations and in many shapes and formats. As long as the data is relevant to our problem, we can make use of it.

Let's assume we were able to pull all relevant data, and performed data wrangling to obtain something like this:

ID	Age	Gender	Blood P.	Cholesterol	Symptoms	Wikipedia R.	Health Records R.	Disease
1	45	Male	High	High	Chest pain	High	Low	Coronary artery disease
2	32	Female	Normal	Normal	Fatigue, Headache	Low	High	Migraine
3	68	Male	High	High	Shortness of breath	High	High	Chronic obstructive pulmonary disease
4	55	Female	Normal	High	Abdominal pain, Nausea	Low	High	Gastritis
5	50	Male	High	Normal	Frequent urination, Excessive thirst	Low	Low	Diabetes

This is just an example of course. The data could be in many different formats and include many other columns and outputs. The R. in Wikipedia R. and Health Records R. refer to Relevance.

💡

Data Wrangling is the process of working with raw data and converting it into a usable form.

Explore and analyze the data

Now that we're working with clean data, it's important to take a closer look and perform what is referred to as Explanatory Data Analysis (EDA) to find patterns and summarize the main characteristics. For example, to understand the distribution of our dataset we can calculate the mean, median, and range of our age variable. We can also analyze the correlation between disease and gender by calculating the percentage of a disease for a specific gender.

💡

The most common programming languages used to perform EDA, and data analysis are: Python and R. Popular libraries for Python include: matplotlib, seaborn, numpy, and others.

We will not go into technical details in this post, but some common analyses done during the EDA phase include:

Data Distribution
Dataset Structure
Handle Missing Values and Outliers
Determine Correlations
Evaluate Assumptions
Visualize by Plotting
Identify Patterns
Understand the Relevancy of External Data

Choose a suitable algorithm

As we've seen earlier, we have a classification problem. We can therefore build model candidates using common classification algorithms and then compare outputs to choose the most accurate.

For this example, I'm going to use two algorithms popular for solving classification problems:

Random Forest
Support Vector Machine (SVM)

Train, test, and refine

Using Python and scikit-learn (a machine learning library for Python), we can determine the accuracy of both algorithms given our dataset. We'll train the model by giving it a piece of the data.

⚠️

While we can use all of the data in our dataset to train the model, we'll be splitting the data into two parts. Commonly, it is an 80/20 split, meaning 80% of our data will go to training, and the remaining 20% will be used for testing. This is done to prevent overfitting. The topic of overfitting was discussed in this article.

# ... Previous code omitted for brevity

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

Splitting data into training and testing sets example using Python

The output is as shown below:

# ... Previous code omitted for brevity

print("SVM Accuracy:", svm_accuracy)
print("Random Forest Accuracy:", rf_accuracy)

SVM Accuracy: 0.2857142857142857
Random Forest Accuracy: 0.8571428571428571

SVM vs. Random Forest accuracy example using Python

Looking at SVM vs. Random Forest accuracy results, we'll choose Random Forest since it has an accuracy of 85% vs. just 28% for SVM.

💡

Accuracy refers to the ability of the model to correctly classify the disease given a set of testing data.

Obviously, you can try other algorithms until you're satisfied with the outputs based on your criteria and the problem you're trying to solve.

The above is essentially what goes into the Supervised Machine Learning process. It's important to highlight that this is an iterative process and does not end after training. We need to deploy the model and acquire feedback from stakeholders which could lead to model refinement based on new data and other factors.

Conclusion

Thanks for reading! In this post, we covered what supervised machine learning is, what machines need to learn, how they learn, and how they improve. We also covered important steps such as Data Wrangling and EDA that are absolutely crucial in the prediction accuracy and relevancy of your model.

If you want to learn more, I have covered the most commonly used algorithms in supervised machine learning problems. These include Simple Linear Regression, Multiple Linear Regression, and their implementation using Python.

Introduction to Supervised Learning in Machine Learning

jeff

What is Supervised Learning

Machine Learning

Supervised vs Unsupervised

5 things you'll need to train your model

Understand the problem

Get and prepare the data

Explore and analyze the data

Choose a suitable algorithm

Train, test, and refine

Conclusion

Read more

What the Heck is MCP? (& what it isn't)

AutoGen Team Workflows: RoundRobinGroupChat, SelectorGroupChat, and Swarm

AI Is Changing Your Brain (Whether You Like It Or Not)

AutoGen 0.4.8 Introduces Native Ollama Support: Run AI Agents Locally