An introduction to linear regression for machine learning

In this post, I will go over the concept of simple linear regression, delve into the underlying mathematical principles of the algorithm, and explore its practical application in the field of machine learning.

⚠️
Heads up! Since this is an introduction to linear regression, we're going to explore a simple method that will help us make predictions called closed-form or analytical solution. It is the simplest and most efficient method when working with small to mid-size datasets.

What is Linear Regression

Linear regression is a type of regression analysis used to make predictions based on labeled data. In this post, we're going to focus on the simplest application of the linear regression algorithm which is referred to as Simple Linear Regression, or simply Linear Regression.

Use Cases

  • House Price Prediction: Linear regression can be used to predict the price of a house based on its square footage (sqft).
    • Note: Predicting the price will depend on the value of square footage (sqft), therefore, we can say that price is the dependent variable and square footage is the independent variable.
  • Sales Forecasting: Companies can predict marketing return on investment (ROI) based on previous advertising spend.
    • Note: Predicting the return on investment will depend on the value of advertising spend, therefore, we can say that the return on investment is the dependent variable and advertising spend is the independent variable.
  • Medical research: Linear regression is often used to predict disease risk in patients given their age.
    • Can you guess the independent variable in this case?

Making Predictions

Let's take the first example and see how we can apply linear regression to predict a house price. Here is a table showing house prices, given their area size (sqft):

We're asked if we can predict the price of a 1300 sqft house given the data we already have. Could we use linear regression in this case?

Of course. Let's add the new house to our table:

💡
Since we have a single independent variable, in this case, "square footage (sqft)", we can use the simple (or single) linear regression formula below.

Simple Linear Regression Formula

ŷ=b0+b1x+e

We're going break this down to understand how it works, then we'll compute our predicted house price ŷ.

  • ŷ (y-hat) is the predicted house price (dependent variable)
  • x is our independent variable, 1300 (sqft)
  • b0 is the intercept
    • It is the value of ŷ when x = 0
  • b1 is the slope
    • It tells us how much ŷ will change for every 1 unit change in x
  • e is the error term, we'll cover e in another post, so for this one let's assume it's zero.

Calculating the slope

We know x is 1,300, but what about b0, and b1? We'll need them to predict the house price (ŷ).

Let's start by looking closer at b1 :

b1=i=1n(xix¯)(yiy¯)i=1n(xix¯)2
I know, I know. It looks ugly and complicated. But, you'll see that it's quite simple as you read further.

  • xi is the value of x for each ith observation (or house #), which in our case is:

    • For the first house in the table, x is 1000 (x1)
    • For the second house in the table, x is 1200 (x2)
    • For the third house in the table, x is 1500 (x3)
    • For the fourth house in the table, x is 1800 (x4)
    • For the fifth house in the table, x is 2200 (x5)
  • (x-bar) is the average value of the known house prices. It is calculated by dividing the sum of the house prices, then dividing it by the total number of houses.
    1000+1200+1500+1800+22005=1540

  • yi is the value of y for each ith observation which in our case means:

    • For the first house in the table, y is 100000 (x1)
    • For the second house in the table, y is 150000 (x2)
    • For the third house in the table, y is 200000 (x3)
    • For the fourth house in the table, y is 250000 (x4)
    • For the fifth house in the table, y is 300000 (x5)
  • ȳ (y-bar) is the average value of the dependent variables and it is calculated by dividing the sum of the dependent variables by the total number of dependent variables
    100000+150000+200000+250000+3000005=200000

We have all we need to replace and solve the formula. We'll start with the numerator:
i=1n(xix¯)(yiy¯)

i1=(10001540)×(100000200000)=54000000
i2=(12001540)×(150000200000)=17000000
i3=(15001540)×(200000200000)=0
i4=(18001540)×(250000200000)=13000000
i5=(22001540)×(300000200000)=66000000

We then add them up:
54000000+17000000+0+13000000+66000000=150000000

Let's calculate the denominator:

i1=(10001540)2=291600
i2=(12001540)2=115600
i3=(15001540)2=1600
i4=(18001540)2=67600
i5=(22001540)2=435600

We then add them up:
291600+115600+1600+67600+435600=912000

Perfect! Let's replace:
b1=150000000912000=164.5

💡
Basically, all this does is determine how much the dependent variable (y-hat) changes when we increase the independent variable value (x) by 1. In our case, for every 1 sqft added, the price will increase by ~164.5$.

Calculating the y-intercept

Great, now let's solve for b0
b0=y¯b1(x¯)

We previously computed the values of b1 = 164.5, ȳ = 200000, and 1540, so we can easily replace:
b0=200000164.5(1540)=53330

Predicting the price

Now, since we calculated all values, we can go ahead and apply the simple linear regression formula to predict our new house price:

y=b0+b1x+0
price=53,330+164.5(1,300)=159,870

Let's see what our table looks like with the prediction (look at house #3):

House # Square Footage (sqft) Price ($)
1 1000 100000
2 1200 150000
3 1500 200000
4 1800 250000
5 2200 300000
House # Square Footage (sqft) Price ($)
...
6 1300 ??
House # Square Footage (sqft) Price ($)
1 1000 100,000
2 1200 150,000
3 1300 159,870
4 1500 200,000
5 1800 250,000
6 2200 300,000

For a new 1300 sqft house, and based on data previously available, our simple regression algorithm was able to predict a reasonable price of 159,870$.

Conclusion

That's all there is to it. In this post, we explored the closed-form solution because it provides a straightforward way to find the best-fit line in linear regression. Keep in mind however, that this method will not be applicable or efficient for non-linear problems or large datasets. In such cases, iterative and other methods are preferred as they can handle the complexity and computational challenges more effectively.

We looked at predicting the outcome given a single variable, however, in the real world you will most likely work with an extension of simple linear regression that predicts an outcome given more than a single variable.

It is called Multiple Linear Regression, and I discussed it in this post. If you'd like to see how Simple and Multiple Linear Regression can be implemented in Python, please check out this post.

Thanks for reading!