Artificial Intelligence 🤖
Predictive Models
Linear regression

Linear Regression

Linear Regression Overview

Linear regression aims to fit a straight line to a set of observations, making it possible to predict new, unseen values. Essentially, it's about identifying a linear relationship between two variables. If we take a set of data points where Petal Width (on the xx-axis) is plotted against the Sepal Length (on the yy-axis), a linear relationship might be evident.

Petal Width vs Height

The fundamental equation for a straight line is y=mx+by = mx + b, where mm represents the slope and bb is the yy-intercept. In our example, the computed slope is 0.890.89 and the yy-intercept is 4.784.78.

Ordinary Least Squares (OLS)

OLS is the technique powering linear regression. It seeks to minimize the squared error between each data point and the regression line. Think of it as computing the variance, but relative to the line we're trying to define rather than the mean.

OLS Example

The equation for a straight line is y=mx+cy = mx + c. Given:

  1. x1,x2,,xnx_1, x_2, \ldots, x_n are the observed values of the independent variable.
  2. y1,y2,,yny_1, y_2, \ldots, y_n are the observed values of the dependent variable.
  3. y1^,y2^,,yn^\hat{y_1}, \hat{y_2}, \ldots, \hat{y_n} are the predicted values of the dependent variable.
  4. The linear regression line is represented as y=mx+cy = mx + c.

The goal of OLS is to minimize the sum of the squared differences (errors) between the observed values (yiy_i) and the predicted values (yi^\hat{y_i}). Mathematically, this is represented as:

E=i=1n(yiyi^)2E = \sum_{i=1}^{n} (y_i - \hat{y_i})^2

Given the equation of the line:

yi^=mxi+c\hat{y_i} = mx_i + c

The error term can be written as:

E=i=1n(yi(mxi+c))2E = \sum_{i=1}^{n} (y_i - (mx_i + c))^2

To find the best values for mm and cc that minimize the error EE, we take the partial derivatives of EE with respect to mm and cc and set them equal to zero.

The formulas for mm (slope) and cc (intercept) are derived from these equations:

  1. Slope (mm): m=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2m = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} Where xˉ\bar{x} and yˉ\bar{y} are the means of xx and yy respectively.

Here, mm (slope) is computed as the correlation between the variables multiplied by the standard deviation of YY divided by the standard deviation of XX. Another equivalent expression for slope, using the correlation coefficient (rr) and standard deviations of XX (σx\sigma_x) and YY (σy\sigma_y):

m=rσyσxm = r \frac{\sigma_y}{\sigma_x}
  1. Intercept (cc): c=yˉmxˉc = \bar{y} - m\bar{x}

cc (intercept) can be calculated as the mean of YY minus the product of slope and the mean of XX. With these equations, you can compute the coefficients mm and cc for any given dataset. Also, when you hear about maximum likelihood estimation in this context, know that it's just a fancier way of discussing linear regression.

Gradient Descent

While OLS is commonly used, gradient descent is an alternative method, especially beneficial for three-dimensional data. It seeks to determine the best fit by following the data's contours.

Co-efficient of Determination (r2r^2)

How do we gauge the effectiveness of our regression? Enter r2r^2, also known as the coefficient of determination. It measures how much of the total variation in YY is captured by your model. The formula to compute it is:

r2=1.0sum of squared errorssum of squared variation from meanr^2 = 1.0 - \frac{\text{sum of squared errors}}{\text{sum of squared variation from mean}}

Where a r2r^2 value close to 0 indicates a poor fit and close to 1 signifies an excellent fit.

Implementing Linear Regression in Python

Want to see it in action? Let's create some random-ish linearly correlated data, plotting page rendering speeds against purchase amounts.

%matplotlib inline
import numpy as np
from pylab import *
 
pageSpeeds = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = 100 - (pageSpeeds + np.random.normal(0, 0.1, 1000)) * 3
scatter(pageSpeeds, purchaseAmount)

Generated Linear Data

With the generated data in place, we can utilize SciPy to determine the best fit line.

from scipy import stats
 
slope, intercept, r_value, p_value, std_err = stats.linregress(pageSpeeds, purchaseAmount)

Having obtained the slope and intercept, we can plot our data alongside the best fit line:

import matplotlib.pyplot as plt
 
def predict(x):
    return slope * x + intercept
 
fitLine = predict(pageSpeeds)
plt.scatter(pageSpeeds, purchaseAmount)
plt.plot(pageSpeeds, fitLine, c='r')
plt.show()

Best Fit Line

Remember, while linear regression is a potent tool, it does have its constraints. It's essential to understand the assumptions and limits when implementing it in practice.