An introduction to linear regression for machine learning
In this post, I will go over the concept of simple linear regression, delve into the underlying mathematical principles of the algorithm, and explore its practical application in the field of machine learning.
What is Linear Regression
Linear regression is a type of regression analysis used to make predictions based on labeled data. In this post, we're going to focus on the simplest application of the linear regression algorithm which is referred to as Simple Linear Regression, or simply Linear Regression.
Use Cases
- House Price Prediction: Linear regression can be used to predict the price of a house based on its square footage (sqft).
- Note: Predicting the price will depend on the value of square footage (sqft), therefore, we can say that price is the dependent variable and square footage is the independent variable.
- Sales Forecasting: Companies can predict marketing return on investment (ROI) based on previous advertising spend.
- Note: Predicting the return on investment will depend on the value of advertising spend, therefore, we can say that the return on investment is the dependent variable and advertising spend is the independent variable.
- Medical research: Linear regression is often used to predict disease risk in patients given their age.
- Can you guess the independent variable in this case?
Making Predictions
Let's take the first example and see how we can apply linear regression to predict a house price. Here is a table showing house prices, given their area size (sqft):
House # | Square Footage (sqft) | Price ($) |
---|---|---|
1 | 1000 | 100000 |
2 | 1200 | 150000 |
3 | 1500 | 200000 |
4 | 1800 | 250000 |
5 | 2200 | 300000 |
We're asked if we can predict the price of a 1300 sqft house given the data we already have. Could we use linear regression in this case?
Of course. Let's add the new house to our table:
House # | Square Footage (sqft) | Price ($) |
---|---|---|
... | ||
6 | 1300 | ?? |
Simple Linear Regression Formula
$$ ŷ = b_0 + b_1 x + e $$
We're going break this down to understand how it works, then we'll compute our predicted house price ŷ.
- ŷ (y-hat) is the predicted house price (dependent variable)
- x is our independent variable, 1300 (sqft)
- b0 is the intercept
- It is the value of ŷ when x = 0
- b1 is the slope
- It tells us how much ŷ will change for every 1 unit change in x
- e is the error term, we'll cover e in another post, so for this one let's assume it's zero.
Calculating the slope
We know x is 1,300, but what about b0, and b1? We'll need them to predict the house price (ŷ).
Let's start by looking closer at b1 :
$$b_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$$
I know, I know. It looks ugly and complicated. But, you'll see that it's quite simple as you read further.
-
xi is the value of x for each ith observation (or house #), which in our case is:
- For the first house in the table, x is 1000 (x1)
- For the second house in the table, x is 1200 (x2)
- For the third house in the table, x is 1500 (x3)
- For the fourth house in the table, x is 1800 (x4)
- For the fifth house in the table, x is 2200 (x5)
-
x̄ (x-bar) is the average value of the known house prices. It is calculated by dividing the sum of the house prices, then dividing it by the total number of houses.
$$ \frac{1000 + 1200 + 1500 + 1800 + 2200}{5} = 1540 $$ -
yi is the value of y for each ith observation which in our case means:
- For the first house in the table, y is 100000 (x1)
- For the second house in the table, y is 150000 (x2)
- For the third house in the table, y is 200000 (x3)
- For the fourth house in the table, y is 250000 (x4)
- For the fifth house in the table, y is 300000 (x5)
-
ȳ (y-bar) is the average value of the dependent variables and it is calculated by dividing the sum of the dependent variables by the total number of dependent variables
$$ \frac{100000 + 150000 + 200000 + 250000 + 300000}{5} = 200000 $$
We have all we need to replace and solve the formula. We'll start with the numerator:
$$\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$$
$$ i_1 = (1000 - 1540) \times (100000 - 200000) = 54000000 $$
$$ i_2 = (1200 - 1540) \times (150000 - 200000) = 17000000 $$
$$ i_3 = (1500 - 1540) \times (200000 - 200000) = 0 $$
$$ i_4 = (1800 - 1540) \times (250000 - 200000) = 13000000 $$
$$ i_5 = (2200 - 1540) \times (300000 - 200000) = 66000000 $$
We then add them up:
$$ 54000000 + 17000000 + 0 + 13000000 + 66000000 = 150000000 $$
Let's calculate the denominator:
$$ i_1 = (1000-1540)^2 = 291600 $$
$$ i_2 = (1200-1540)^2 = 115600 $$
$$ i_3 = (1500-1540)^2 = 1600 $$
$$ i_4 = (1800-1540)^2 = 67600 $$
$$ i_5 = (2200-1540)^2 = 435600 $$
We then add them up:
$$ 291600 + 115600 + 1600 + 67600 + 435600 = 912000 $$
Perfect! Let's replace:
$$b_1 = \frac{150000000}{912000} = 164.5 $$
Calculating the y-intercept
Great, now let's solve for b0
$$b_0 = \bar{y} - b_1(\bar{x})$$
We previously computed the values of b1 = 164.5, ȳ = 200000, and x̄ 1540, so we can easily replace:
$$b_0 = 200000 - 164.5(1540) = -53330$$
Predicting the price
Now, since we calculated all values, we can go ahead and apply the simple linear regression formula to predict our new house price:
$$ y = b_0 + b_1 x + 0 $$
$$ price = -53,330 + 164.5 (1,300) = 159,870 $$
Let's see what our table looks like with the prediction (look at house #3):
House # | Square Footage (sqft) | Price ($) |
---|---|---|
1 | 1000 | 100,000 |
2 | 1200 | 150,000 |
3 | 1300 | 159,870 |
4 | 1500 | 200,000 |
5 | 1800 | 250,000 |
6 | 2200 | 300,000 |
For a new 1300 sqft house, and based on data previously available, our simple regression algorithm was able to predict a reasonable price of 159,870$.
Conclusion
That's all there is to it. In this post, we explored the closed-form solution because it provides a straightforward way to find the best-fit line in linear regression. Keep in mind however, that this method will not be applicable or efficient for non-linear problems or large datasets. In such cases, iterative and other methods are preferred as they can handle the complexity and computational challenges more effectively.
We looked at predicting the outcome given a single variable, however, in the real world you will most likely work with an extension of simple linear regression that predicts an outcome given more than a single variable.
It is called Multiple Linear Regression, and I discussed it in this post. If you'd like to see how Simple and Multiple Linear Regression can be implemented in Python, please check out this post.
Thanks for reading!