Introduction to Supervised Learning in Machine Learning

In this post we'll go into the concept of supervised learning, the requirements for machines to learn, and the process of learning and enhancing prediction accuracy.

What is Supervised Learning

When it comes to machine learning, there are primarily four types:

  • Supervised Machine Learning
  • Unsupervised Machine Learning
  • Semi-Supervised Machine Learning
  • Reinforcement Learning

Supervised machine learning refers to the process of training a machine using labeled data. Labeled data can consist of numeric or string values. For example, imagine that you have photos of animals, such as cats and dogs. To train your machine to recognize the animal, you’ll need to “label” or indicate the name of the animal alongside each animal image. The machine will then learn to pick up similar patterns in photos and predict the appropriate label.

Machine Learning

Machine Learning is a term that refers to the process that a machine undergoes so that it can produce predictions. As mentioned, a machine can identify a cat in a photo even if it has never seen this particular cat. But how?

Through training of course, which involves a recursive process that improves output (or prediction) accuracy. In supervised machine learning, we teach the machine to identify things based on the labeled data we give it.

Today, you see and interact with trained machines everywhere. Netflix, YouTube, TikTok, and most services implement some kind of algorithm that uses your data (which was collected from you) to learn about you so it can give you things you like. That's why you spend endless hours scrolling.

⚠️
The more data you give these services, the more they learn about you. Some of them even know you more than you know yourself.

Supervised vs Unsupervised

Think of it like this:

As humans, we recognize a cat as a cat because we have been taught what a cat looks like by our parents and teachers. They basically "supervised" us and "labeled" our data. However, when we classify good and bad friends, we rely on our personal experiences and observations to achieve this. Similarly, machines can learn through supervised learning, where they are taught to recognize specific images, or through unsupervised learning, where they make their own judgments based on the data provided to them.

Compared to us, the learning process of a machine is different but it was inspired by our brains. To train computers, we'll mostly use statistical algorithms like Linear Regression, Decision Trees (DTs), and K-nearest neighbours (KNNs).

💡
An algorithm is a sequence of operations that is typically used by computers to find the correct solution to a problem (or identify that there are no correct solutions).

5 things you'll need to train your model

Understand the problem

First, you'll need to understand the problem that you're trying to solve. Usually, we can use machine learning to answer a broad range of questions, things like:

  • Can we accurately predict diseases in patients?
  • Can we predict the price of houses?

It's important to understand the question we're trying to answer. Let's take the first question from the list above:

"Can we accurately predict diseases in patients?"

We can rephrase this question to:

"Is it possible to utilize historical patient data such as age, gender, blood pressure, cholesterol, and medical conditions to predict the likelihood of a new patient developing a disease?"

The answer is: Yes. This is known as a classification problem, where the input data is used to predict the patient’s potential to develop a new disease based on a list of predetermined categories.

Get and prepare the data

Imagine you buy a textbook for a math class, and all the papers are blank. Or better yet, imagine the papers include random information, not related to the topic or even unrecognizable characters. Would you be able to learn anything? Of course not. You'll need organized information. Similarly, we'll need to prepare our data before we can use it.

💡
The quality of your data will determine the quality of your predictions.

So, the next step is to get the data. It could be located in many places, like:

  • Hospital internal database (SQL)
  • Publicly available information (Web Scraping)
  • Public health records (JSON)

As you can see, the data could be in multiple locations and in many shapes and formats. As long as the data is relevant to our problem, we can make use of it.

Let's assume we were able to pull all relevant data, and performed data wrangling to obtain something like this:

ID Age Gender Blood P. Cholesterol Symptoms Wikipedia R. Health Records R. Disease
1 45 Male High High Chest pain High Low Coronary artery disease
2 32 Female Normal Normal Fatigue, Headache Low High Migraine
3 68 Male High High Shortness of breath High High Chronic obstructive pulmonary disease
4 55 Female Normal High Abdominal pain, Nausea Low High Gastritis
5 50 Male High Normal Frequent urination, Excessive thirst Low Low Diabetes

This is just an example of course. The data could be in many different formats and include many other columns and outputs. The R. in Wikipedia R. and Health Records R. refer to Relevance.

💡
Data Wrangling is the process of working with raw data and converting it into a usable form.

Explore and analyze the data

Now that we're working with clean data, it's important to take a closer look and perform what is referred to as Explanatory Data Analysis (EDA) to find patterns and summarize the main characteristics. For example, to understand the distribution of our dataset we can calculate the mean, median, and range of our age variable. We can also analyze the correlation between disease and gender by calculating the percentage of a disease for a specific gender.

💡
The most common programming languages used to perform EDA, and data analysis are: Python and R. Popular libraries for Python include: matplotlib, seaborn, numpy, and others.

We will not go into technical details in this post, but some common analyses done during the EDA phase include:

  • Data Distribution
  • Dataset Structure
  • Handle Missing Values and Outliers
  • Determine Correlations
  • Evaluate Assumptions
  • Visualize by Plotting
  • Identify Patterns
  • Understand the Relevancy of External Data

Choose a suitable algorithm

As we've seen earlier, we have a classification problem. We can therefore build model candidates using common classification algorithms and then compare outputs to choose the most accurate.

For this example, I'm going to use two algorithms popular for solving classification problems:

  • Random Forest
  • Support Vector Machine (SVM)

Train, test, and refine

Using Python and scikit-learn (a machine learning library for Python), we can determine the accuracy of both algorithms given our dataset. We'll train the model by giving it a piece of the data.

⚠️
While we can use all of the data in our dataset to train the model, we'll be splitting the data into two parts. Commonly, it is an 80/20 split, meaning 80% of our data will go to training, and the remaining 20% will be used for testing. This is done to prevent overfitting. The topic of overfitting was discussed in this article.
# ... Previous code omitted for brevity

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

Splitting data into training and testing sets example using Python

The output is as shown below:

# ... Previous code omitted for brevity

print("SVM Accuracy:", svm_accuracy)
print("Random Forest Accuracy:", rf_accuracy)

SVM Accuracy: 0.2857142857142857
Random Forest Accuracy: 0.8571428571428571

SVM vs. Random Forest accuracy example using Python

Looking at SVM vs. Random Forest accuracy results, we'll choose Random Forest since it has an accuracy of 85% vs. just 28% for SVM.

💡
Accuracy refers to the ability of the model to correctly classify the disease given a set of testing data.

Obviously, you can try other algorithms until you're satisfied with the outputs based on your criteria and the problem you're trying to solve.

The above is essentially what goes into the Supervised Machine Learning process. It's important to highlight that this is an iterative process and does not end after training. We need to deploy the model and acquire feedback from stakeholders which could lead to model refinement based on new data and other factors.

Conclusion

Thanks for reading! In this post, we covered what supervised machine learning is, what machines need to learn, how they learn, and how they improve. We also covered important steps such as Data Wrangling and EDA that are absolutely crucial in the prediction accuracy and relevancy of your model.

If you want to learn more, I have covered the most commonly used algorithms in supervised machine learning problems. These include Simple Linear Regression, Multiple Linear Regression, and their implementation using Python.