Regression With Scikit Learn (Part 2)
Published: Aug 17, 2021
Last updated: Aug 17, 2021
This is Day 30 of the #100DaysOfPython challenge.
This post will continue on from part one and break down the basics of linear regression and also explain how we can take the work that we did and expand upon that to apply a train-test split to our dataset.
Source code can be found on my GitHub repo okeeffed/regression-with-scikit-learn-part-two
.
Prerequisites
- Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post "The Definitive Guide to Conda Environments" on "Towards Data Science".
- Familiarity with JupyterLab. See here for my post on JupyterLab.
- These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
- Read "Regression With Scikit Learn (Part One)"
Getting started
Let's create the regression-with-scikit-learn-part-two
by cloning the work we did yesterday. The packages required will be available in our conda
environment.
If you are unsure on how to activate the conda
virtual environment, please look to the prerequisites or resources section for links on conda
fundamentals.
# Make the `regression-with-scikit-learn-part-two` directory $ git clone https://github.com/okeeffed/regression-with-scikit-learn.git regression-with-scikit-learn-part-two $ cd regression-with-scikit-learn-part-two
At this stage, the file docs/linear_regression.ipynb
already exists and we can work off this material.
Before we start, let's go over the basics of linear regression.
Linear regression basics
The line equation to calculates the linear line is described as the following:
The statement can be broken down into the following:
Variable/Statement | Description |
---|---|
y | Target variable |
x | Single feature |
a,b | Parameters of the model |
To calculate the values of a
and b
, we need to define an error function (also known as the cost function or loss function) for any line and choose the line that minimizes the error function.
The aim is to minimize the vertical line distance between the fit line and the data point.
The distance itself is known as the residual. Because a positive and negative residuals (from data points above and below the line) will cancel each other out, we use the sum of the squares of the residuals.
This will be our loss function and is called Ordinate Least Squares (OLS).
Wikipedia describes OLS as the following:
OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable.
Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface — the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.
To put that into some human speak, the axis
of the dependent variable is our y
axis, and so we sum the square between each data point in the set and the corresponding point on the X-axis to the regression line. The smaller the distance, the better the fit.
When we call the fit
method from our LinearRegression
object, we are actually calculating the parameters of the line by performing OLS under the hood.
Higher dimensions of linear regression
So far, the examples we have done are working on a dimension that is easily understood with y
being calculated by one feature on the X-axis (from our example yesterday, this was the "Number Of Rooms (feature) vs Value Of House (target variable)"").
However, in the real world, we often have more than one feature.
To calculate multiple features (or dimensions), our linear regression equation becomes the following:
In application, the Scikit-learn API can help us with this as we pass two arrays to the fit
method:
- Array with he features.
- Array with the target variable.
Let's do just that and see how it works.
Applying the train/test split to our dataset
In our file docs/linear_regression.ipynb
, we can add the following:
from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) reg_all = LinearRegression() reg_all.fit(X_train, y_train) y_pred = reg_all.predict(X_test) print(reg_all.score(X_test, y_test)) # outputs 0.711226005748496
The default score method for linear regression is R squared
. For more details, see the documentation.
Note: You will never use Linear Regression out of the box like this. You will almost always want to use regularization. We will dive into this in the next part.
Summary
Today's post spoke to the math that describes our linear line generated by the linear regression fit.
We then spoke about how this calculation is worked out with more dimensions added into the mix.
Finally, we demonstrated this with a train_test_split
and LinearRegression
object.
As noted in the last section, this is not how you would use Linear Regression in practice. You will (almost) always want to use regularization.
This will be our topic in tomorrow's post.
Resources and further reading
Photo credit: deepakrautela
Regression With Scikit Learn (Part 2)
Introduction