First Look At Supervised Learning With Classification
Published: Aug 14, 2021
Last updated: Aug 14, 2021
This is Day 27 of the #100DaysOfPython challenge.
This post will look at setting up our template repository for scikit-learn
with Miniconda
(a minimal installer for conda
).
We will do this by using the scikit-learn
package to create a GitHub template repository for our Machine Learning projects.
The final code can be found at okeeffed/supervised-learning-with-scikit-learn-template
.
Prerequisites
- Familiarity Conda package, dependency and virtual environment manager. A handy additional reference for Conda is the blog post "The Definitive Guide to Conda Environments" on "Towards Data Science".
- Familiarity with JupyterLab. See here for my post on JupyterLab.
- These projects will also run Python notebooks on VSCode with the Jupyter Notebooks extension. If you do not use VSCode, it is expected that you know how to run notebooks (or alter the method for what works best for you).
Getting started
Let's create the supervised-learning-with-scikit-learn-template
directory and install the required packages.
# Make the `supervised-learning-with-scikit-learn-template` directory $ mkdir supervised-learning-with-scikit-learn-template $ cd supervised-learning-with-scikit-learn-template # Create a docs folder to place our notebook $ mkdir docs $ touch docs/supervised-learning-with-scikit-learn-template.ipynb # Install our require dependencies $ conda install scikit-learn pandas numpy matplotlib ipykernel
At this stage, we are ready to take a first look at some of the packages we will be using over the upcoming posts.
There will be more in-depth posts over the coming days with each package.
Today will include a short look at a iris
dataset provided by Scikit Learn.
With this in mind, we can now begin adding code to our notebook.
Writing our first notebook
We will write seven cells in the notebook:
- Importing our required packages and setting a graph style.
- Exploring the
iris
dataset. - Assigning the iris dataset to their
X
andy
variables. - Creating and exploring the data frame.
- Visualizing the output and making sense of the data.
- Creating a k-nearest neighbors classifier.
- Applying the classifier to some unlabelled data and assigning predicted classes to that data.
Importing required packages
In our file docs/supervised-learning-with-scikit-learn-template.ipynb
, we can add the following:
# Importing our required libraries from sklearn import datasets import pandas as pd import numpy as np import matplotlib.pyplot as plt plt.style.use('ggplot')
We are using four main libraries here:
sklearn
which includes simple and efficient tools for predictive data analysis.pandas
for a data analysis and manipulation tool.numpy
to help with scientific computing.matplotlib
as our data visualization library.
Finally, we are updating the pyplot
style to use ggplot
for aesthetics. More on that can be found in the docs here.
Exploring the dataset
As a first look, we will explore the dataset with some helpful functions to get a better idea of what is happening.
# Exploring the Iris dataset iris = datasets.load_iris() type(iris) # sklearn.datasets.base.Bunch - a dictionary-like object with key-value pairs print(iris.keys()) # dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names']) print(iris.feature_names) # ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] type(iris.data) # numpy.ndarray type(iris.target) # numpy.ndarray iris.data.shape # (150, 4) - 150 rows and 4 columns iris.target_names # array(['setosa', 'versicolor', 'virginica'], dtype='<U10') these will be encoded as 0, 1, 2
Some things to take away:
iris.data
is our features for the data (also known as independent or predictor variables). There are 4 features (4 columns) in the data.- The features themselves can be explores with the
feature_names
property. In this data, the features aresepal length (cm)
,sepal width (cm)
,petal length (cm)
andpetal width (cm)
. - We notice that the
target
is a vector of integers. Our three possible classes ofsetosa
,versicolor
andvirginica
will be encoded as 0, 1, 2. - The
iris.data.shape
tells use that there are 150 rows of data to use as historical data to help us find features which might be useful in identifying future entries.
Assigning the iris dataset to a variable
The next step is a help to assign the data to more apt variables to be used.
Our features are assigned to X
while the target variables are assigned to y
.
# Setting our features to X and our target variables to y X = iris.data y = iris.target
Creating and exploring the data frame
We Use the X
column to create a data frame
.
The data frame is a tabular data structure with rows and columns.
# Create the dataframe df = pd.DataFrame(X, columns=iris.feature_names) print(df.head()) # print the first 5 rows
Calling the .head()
method on a data frame allows us to explore the first 5 entries of the data frame in the tabular structure.
The output is as follows:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2
This is a helpful preview to understand how our data will be used in the final matrix.
Visualizing the output
Finally, we can visualize the output by using a scatter matrix
.
The matrix is a grid of scatter plots that shows the relationship between each pair of features. It allows us to explore many relationships in one chart.
# Help visualize the data. # c stands for color so we display color by species. # figsize will be the size of the figure. # marker is the shape of the points. _ = pd.plotting.scatter_matrix(df, c = y, figsize = [8,8], s = 150, marker = 'D') # Diagonal line are histograms of the features corresponding the rows and columns. # The rest of the lines are scatter plots of the column feature vs the row feature color by target variable. # We can see that petalwidth and petallength are highly correlated. plt.show()
In our notebook, this will output the following scatter matrix:
Scatter matrix in VSCode
It is up to us to interpret the data.
On the diagonal line, we can see histograms that bucket together the features corresponding to the rows and columns.
The colors on the scatter plot are assigned by our target variables. As we have three target variables, we will get three different colors plotted out.
The rest are scatter plots of the column feature vs the row feature color by target variable.
Something that you will notice on the second-from-the-bottom on the right scatter plot (petal length vs petal width) is that we get a linear grouping of elements. This tells us that there is a strong correlation between the two features.
You can read more about interpreting scatter plots here.
Constructing a classifier
There are different algorithms for classifying data. In our example, we will be going with k-nearest neighbors, an algorithm that creates predication boundaries to label data based on n
closest data points.
We will do more of a deep dive on this classifier in another blog post. For now, we will see how to construct the classifier and train it against our labelled data.
from sklearn.neighbors import KNeighborsClassifier # Set this to create boundaries based on 6 closest neight. knn = KNeighborsClassifier(n_neighbors=6) knn.fit(iris.data, iris.target)
The KNeighborsClassifier
are helpful to understand more about the classifier and the available arguments.
In general, there are defaults for all possible arguments. Taken from the docs:
n_neighbors: int, default=5 Number of neighbors to use by default for kneighbors queries. weights: {`uniform`, `distance`} or callable, default=`uniform` weight function used in prediction. Possible values: `uniform` : uniform weights. All points in each neighborhood are weighted equally. `distance` : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. algorithm: {`auto`, `ball_tree`, `kd_tree`, `brute`}, default=`auto` Algorithm used to compute the nearest neighbors: `ball_tree` will use BallTree `kd_tree` will use KDTree `brute` will use a brute-force search. `auto` will attempt to decide the most appropriate algorithm based on the values passed to fit method. Note: fitting on sparse input will override the setting of this parameter, using brute force. leaf_size: int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem. pint, default=2 Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. metric: str or callable, default=`minkowski` the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of DistanceMetric for a list of available metrics. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors. metric_params: dict, default=None Additional keyword arguments for the metric function. n_jobs: int, default=None The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details. Doesn`t affect fit method.
Again, we will deep dive into this in another topic, but all you need to understand in our code is that we are overriding the default of n_neighbors
to be 6
to make the prediction against the six closest neighbors.
The knn.fit(iris.data, iris.target)
invocation will train the classifier on the data. As soon as we have called fit
, the classifier is ready to make predictions.
Predicting unlabeled data
To make predictions, we need to call predict
on the classifier and pass some unlabelled data.
We can use what we learned already about data frames to display that data as mapped to their features.
# A set of unlabeled data. X_new = np.array([[5.6, 2.8, 3.9, 1.1], [5.7, 2.6, 3.8, 1.3], [4.7, 3.2, 1.3, 0.2]]) X_new.shape # (1, 4) - 1 data point and 4 features (assuming in the example about you just used the first example and not more) # Showing the data frame df_new = pd.DataFrame(X_new, columns=iris.feature_names) print(df_new.head()) # print unlabeled data as a data frame
This will print the following:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.6 2.8 3.9 1.1 1 5.7 2.6 3.8 1.3 2 4.7 3.2 1.3 0.2
Finally, we can apply what we have done to predict the class of the unlabeled data.
prediction = knn.predict(X_new) print(prediction) # array([0]) - 0 is the label for the first example which will map to one of the iris labels # The prediction is [1 1 0] which maps to [versicolor versicolor setosa]
Our prediction printed out [1 1 0]
which when decoded and mapped back to our labels results in the labels [versicolor versicolor setosa]
.
Therefore, our classified has predicated that the first and second datapoint is versicolor
and that is a setosa
is the class of the final data point.
Summary
Today's post set up a starting repository for all future posts on Machine Learning.
We then wrote a Python notebook that added cells to the notebook to show how to load the iris dataset, how to label the data, and how to create a classifier and apply that classifier.
Future posts will start to become more granular and dive deeper into particular topics around classifiers (and more machine learning applications).
Resources and further reading
Photo credit: itssammoqadam
First Look At Supervised Learning With Classification
Introduction