Machine Learning

An introduction to linear regression for machine learning

In this post, I will go over the concept of simple linear regression, delve into the underlying mathematical principles of the algorithm, and explore its practical application in the field of machine learning.

jeff

May 1, 2023 — 4 min read

⚠️

Heads up! Since this is an introduction to linear regression, we're going to explore a simple method that will help us make predictions called closed-form or analytical solution. It is the simplest and most efficient method when working with small to mid-size datasets.

What is Linear Regression

Linear regression is a type of regression analysis used to make predictions based on labeled data. In this post, we're going to focus on the simplest application of the linear regression algorithm which is referred to as Simple Linear Regression, or simply Linear Regression.

Use Cases

House Price Prediction: Linear regression can be used to predict the price of a house based on its square footage (sqft).
- Note: Predicting the price will depend on the value of square footage (sqft), therefore, we can say that price is the dependent variable and square footage is the independent variable.
Sales Forecasting: Companies can predict marketing return on investment (ROI) based on previous advertising spend.
- Note: Predicting the return on investment will depend on the value of advertising spend, therefore, we can say that the return on investment is the dependent variable and advertising spend is the independent variable.
Medical research: Linear regression is often used to predict disease risk in patients given their age.
- Can you guess the independent variable in this case?

Making Predictions

Let's take the first example and see how we can apply linear regression to predict a house price. Here is a table showing house prices, given their area size (sqft):

House #	Square Footage (sqft)	Price ($)
1	1000	100000
2	1200	150000
3	1500	200000
4	1800	250000
5	2200	300000

We're asked if we can predict the price of a 1300 sqft house given the data we already have. Could we use linear regression in this case?

Of course. Let's add the new house to our table:

House #	Square Footage (sqft)	Price ($)
...
6	1300	??

💡

Since we have a single independent variable, in this case, "square footage (sqft)", we can use the simple (or single) linear regression formula below.

Simple Linear Regression Formula

$$ ŷ = b_0 + b_1 x + e $$

We're going break this down to understand how it works, then we'll compute our predicted house price ŷ.

ŷ (y-hat) is the predicted house price (dependent variable)
x is our independent variable, 1300 (sqft)
b₀ is the intercept
- It is the value of ŷ when x = 0
b₁ is the slope
- It tells us how much ŷ will change for every 1 unit change in x
e is the error term, we'll cover e in another post, so for this one let's assume it's zero.

Calculating the slope

We know x is 1,300, but what about b₀, and b₁? We'll need them to predict the house price (ŷ).

Let's start by looking closer at b₁ :

$$b_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}$$
I know, I know. It looks ugly and complicated. But, you'll see that it's quite simple as you read further.

x_i is the value of x for each ith observation (or house #), which in our case is:
- For the first house in the table, x is 1000 (x₁)
- For the second house in the table, x is 1200 (x₂)
- For the third house in the table, x is 1500 (x₃)
- For the fourth house in the table, x is 1800 (x₄)
- For the fifth house in the table, x is 2200 (x₅)
x̄ (x-bar) is the average value of the known house prices. It is calculated by dividing the sum of the house prices, then dividing it by the total number of houses.
$$ \frac{1000 + 1200 + 1500 + 1800 + 2200}{5} = 1540 $$
y_i is the value of y for each ith observation which in our case means:
- For the first house in the table, y is 100000 (x₁)
- For the second house in the table, y is 150000 (x₂)
- For the third house in the table, y is 200000 (x₃)
- For the fourth house in the table, y is 250000 (x₄)
- For the fifth house in the table, y is 300000 (x₅)
ȳ (y-bar) is the average value of the dependent variables and it is calculated by dividing the sum of the dependent variables by the total number of dependent variables
$$ \frac{100000 + 150000 + 200000 + 250000 + 300000}{5} = 200000 $$

We have all we need to replace and solve the formula. We'll start with the numerator:
$$\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$$

$$ i_1 = (1000 - 1540) \times (100000 - 200000) = 54000000 $$
$$ i_2 = (1200 - 1540) \times (150000 - 200000) = 17000000 $$
$$ i_3 = (1500 - 1540) \times (200000 - 200000) = 0 $$
$$ i_4 = (1800 - 1540) \times (250000 - 200000) = 13000000 $$
$$ i_5 = (2200 - 1540) \times (300000 - 200000) = 66000000 $$

We then add them up:
$$ 54000000 + 17000000 + 0 + 13000000 + 66000000 = 150000000 $$

Let's calculate the denominator:

$$ i_1 = (1000-1540)^2 = 291600 $$
$$ i_2 = (1200-1540)^2 = 115600 $$
$$ i_3 = (1500-1540)^2 = 1600 $$
$$ i_4 = (1800-1540)^2 = 67600 $$
$$ i_5 = (2200-1540)^2 = 435600 $$

We then add them up:
$$ 291600 + 115600 + 1600 + 67600 + 435600 = 912000 $$

Perfect! Let's replace:
$$b_1 = \frac{150000000}{912000} = 164.5 $$

💡

Basically, all this does is determine how much the dependent variable (y-hat) changes when we increase the independent variable value (x) by 1. In our case, for every 1 sqft added, the price will increase by ~164.5$.

Calculating the y-intercept

Great, now let's solve for b₀
$$b_0 = \bar{y} - b_1(\bar{x})$$

We previously computed the values of b₁ = 164.5, ȳ = 200000, and x̄ 1540, so we can easily replace:
$$b_0 = 200000 - 164.5(1540) = -53330$$

Predicting the price

Now, since we calculated all values, we can go ahead and apply the simple linear regression formula to predict our new house price:

$$ y = b_0 + b_1 x + 0 $$
$$ price = -53,330 + 164.5 (1,300) = 159,870 $$

Let's see what our table looks like with the prediction (look at house #3):

House #	Square Footage (sqft)	Price ($)
1	1000	100,000
2	1200	150,000
3	1300	159,870
4	1500	200,000
5	1800	250,000
6	2200	300,000

For a new 1300 sqft house, and based on data previously available, our simple regression algorithm was able to predict a reasonable price of 159,870$.

Conclusion

That's all there is to it. In this post, we explored the closed-form solution because it provides a straightforward way to find the best-fit line in linear regression. Keep in mind however, that this method will not be applicable or efficient for non-linear problems or large datasets. In such cases, iterative and other methods are preferred as they can handle the complexity and computational challenges more effectively.

We looked at predicting the outcome given a single variable, however, in the real world you will most likely work with an extension of simple linear regression that predicts an outcome given more than a single variable.

It is called Multiple Linear Regression, and I discussed it in this post. If you'd like to see how Simple and Multiple Linear Regression can be implemented in Python, please check out this post.

Thanks for reading!

An introduction to linear regression for machine learning

jeff

What is Linear Regression

Use Cases

Making Predictions

Simple Linear Regression Formula

Calculating the slope

Calculating the y-intercept

Predicting the price

Conclusion

Read more

What the Heck is MCP? (& what it isn't)

AutoGen Team Workflows: RoundRobinGroupChat, SelectorGroupChat, and Swarm

AI Is Changing Your Brain (Whether You Like It Or Not)

AutoGen 0.4.8 Introduces Native Ollama Support: Run AI Agents Locally