Machine Learning Introduction

Machine learning involves algorithms that learn from observational data to make predictions. Even techniques such as linear regression, where a line is fitted to data to make predictions, are considered machine learning. A crucial concept in machine learning is train/test, which evaluates the efficacy of a machine learning model. This concept becomes clear when understanding unsupervised and supervised learning.

Unsupervised Learning vs Supervised Learning

The main distinction between supervised and unsupervised learning: Labeled datasets. To put it simply, supervised learning uses labeled input and output data, while an unsupervised learning algorithm does not

Unsupervised Learning Example

Unsupervised Learning

Unsupervised learning doesn't provide the model with answers to learn from. Instead, it presents data, and the algorithm attempts to discern it without external guidance. For instance, clustering without predefined categories requires the machine learning algorithm to infer those categories/boundaries. The challenge is unpredictability; the categorization depends on the given similarity metric. Unsupervised learning can unveil hidden classifications or latent variables.

Using unsupervised learning on people, such as for a dating site, can reveal clusters based on attributes that might defy conventional understanding. Similarly, clustering movies based on various factors like release date, running length, or country of release can yield unexpected insights. Analyzing product description texts can also uncover terms significant to certain categories.

Supervised Learning

Supervised learning provides the model with a set of answers from which to learn. It employs a training dataset to teach the model about relationships between features and desired outcomes, allowing predictions on new data.

For example, predicting car prices from car attributes would involve training the model using known car prices. Once trained, the model can predict prices for unseen cars.

Evaluating Supervised Learning

Supervised learning is evaluated using train/test. Observational data is divided into a training set (for model creation) and a testing set (for model validation). The model, built using the training data, is then evaluated against the testing data to ascertain its accuracy.

Metrics like r-squared or root-mean-square error quantify performance. These metrics aid in model comparison, tuning, and accuracy maximization. However, care must be taken with dataset size and representation. Random sampling is vital to avoid potential patterns in the data, and vigilant monitoring is necessary to prevent overfitting.

Misleading Results Example

K-fold Cross Validation

K-fold cross-validation mitigates the issues of train/test. In this method, we split the data-set into $k$ number of subsets (known as folds) then we perform training on the all the subsets but leave one ( $k-1$ ) subset for the evaluation of the trained model. Average performance, typically using the r-squared score, is then computed across all iterations.

K-fold cross-validation steps:

Split data into K randomly-assigned segments.
Designate one segment as test data.
Train using the remaining K-1 segments, measuring performance against the test set.
Compute the average of the K-1 r-squared scores.

Using Train/Test to Prevent Overfitting in Polynomial Regression

In supervised machine learning, regression models are fundamental. Here, we focus on polynomial regression to determine the optimal degree for a given dataset. The dataset consists of randomly generated page speeds and purchase amounts, constructed with an intentional exponential relationship:

import numpy as np
from pylab import scatter
 
np.random.seed(2)
 
pageSpeeds = np.random.normal(3.0, 1.0, 100)
purchaseAmount = np.random.normal(50.0, 10.0, 100) / pageSpeeds
 
scatter(pageSpeeds, purchaseAmount)

For the train/test approach, we allocate 80% of the data for training and reserve the remaining 20% for testing:

trainX = pageSpeeds[:80]
testX = pageSpeeds[80:]
 
trainY = purchaseAmount[:80]
testY = purchaseAmount[80:]

Although the data is segmented sequentially due to its initial random generation, real-world datasets require a random shuffle before splitting. Visualizing the training dataset yields:

And the test dataset:

The decision to fit an 8th-degree polynomial is arbitrary, aiming to detect overfitting:

x = np.array(trainX)
y = np.array(trainY)
 
p4 = np.poly1d(np.polyfit(x, y, 8))

Plotting the polynomial against training data reveals potential overfitting:

However, testing the polynomial against the test data suggests a reasonable, yet imperfect fit:

The r-squared score, a measure of fit quality, is calculated using sklearn's r2_score:

from sklearn.metrics import r2_score
r2 = r2_score(testY, p4(testX))

The score for the training data is 0.95, whereas for the test data it drops to -0.47. This significant drop emphasizes the overfitting problem. In fact a negative coefficient of determination indicates that our model's predictions are worse than a constant function that always predicts the mean of the data...

While this example leverages a simplistic approach for train/test split, more efficient methods, including utilities from the pandas library and techniques like k-fold cross validation, are explored later.

Resources:

Stanford Stats 202 (opens in a new tab)

Hypotheses about Distributions: Goodness-of-Fit Tests Feature Engineering