Linear Regression

Linear Regression Overview

Linear regression aims to fit a straight line to a set of observations, making it possible to predict new, unseen values. Essentially, it's about identifying a linear relationship between two variables. If we take a set of data points where Petal Width (on the $x$ -axis) is plotted against the Sepal Length (on the $y$ -axis), a linear relationship might be evident.

Petal Width vs Height

The fundamental equation for a straight line is $y = mx + b$ , where $m$ represents the slope and $b$ is the $y$ -intercept. In our example, the computed slope is $0.89$ and the $y$ -intercept is $4.78$ .

Ordinary Least Squares (OLS)

OLS is the technique powering linear regression. It seeks to minimize the squared error between each data point and the regression line. Think of it as computing the variance, but relative to the line we're trying to define rather than the mean.

OLS Example

The equation for a straight line is $y = mx + c$ . Given:

$x_1, x_2, \ldots, x_n$ are the observed values of the independent variable.
$y_1, y_2, \ldots, y_n$ are the observed values of the dependent variable.
$\hat{y_1}, \hat{y_2}, \ldots, \hat{y_n}$ are the predicted values of the dependent variable.
The linear regression line is represented as $y = mx + c$ .

The goal of OLS is to minimize the sum of the squared differences (errors) between the observed values ( $y_i$ ) and the predicted values ( $\hat{y_i}$ ). Mathematically, this is represented as:

E = \sum_{i=1}^{n} (y_i - \hat{y_i})^2

Given the equation of the line:

\hat{y_i} = mx_i + c

The error term can be written as:

E = \sum_{i=1}^{n} (y_i - (mx_i + c))^2

To find the best values for $m$ and $c$ that minimize the error $E$ , we take the partial derivatives of $E$ with respect to $m$ and $c$ and set them equal to zero.

The formulas for $m$ (slope) and $c$ (intercept) are derived from these equations:

Slope ( $m$ ): $m = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2}$ Where $\bar{x}$ and $\bar{y}$ are the means of $x$ and $y$ respectively.

Here, $m$ (slope) is computed as the correlation between the variables multiplied by the standard deviation of $Y$ divided by the standard deviation of $X$ . Another equivalent expression for slope, using the correlation coefficient ( $r$ ) and standard deviations of $X$ ( $\sigma_x$ ) and $Y$ ( $\sigma_y$ ):

m = r \frac{\sigma_y}{\sigma_x}

Intercept ( $c$ ): $c = \bar{y} - m\bar{x}$

$c$ (intercept) can be calculated as the mean of $Y$ minus the product of slope and the mean of $X$ . With these equations, you can compute the coefficients $m$ and $c$ for any given dataset. Also, when you hear about maximum likelihood estimation in this context, know that it's just a fancier way of discussing linear regression.

Gradient Descent

While OLS is commonly used, gradient descent is an alternative method, especially beneficial for three-dimensional data. It seeks to determine the best fit by following the data's contours.

Co-efficient of Determination ( $r^2$ )

How do we gauge the effectiveness of our regression? Enter $r^2$ , also known as the coefficient of determination. It measures how much of the total variation in $Y$ is captured by your model. The formula to compute it is:

r^2 = 1.0 - \frac{\text{sum of squared errors}}{\text{sum of squared variation from mean}}

Where a $r^2$ value close to 0 indicates a poor fit and close to 1 signifies an excellent fit.

Implementing Linear Regression in Python

Want to see it in action? Let's create some random-ish linearly correlated data, plotting page rendering speeds against purchase amounts.

%matplotlib inline
import numpy as np
from pylab import *
 
pageSpeeds = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = 100 - (pageSpeeds + np.random.normal(0, 0.1, 1000)) * 3
scatter(pageSpeeds, purchaseAmount)

Generated Linear Data

With the generated data in place, we can utilize SciPy to determine the best fit line.

from scipy import stats
 
slope, intercept, r_value, p_value, std_err = stats.linregress(pageSpeeds, purchaseAmount)

Having obtained the slope and intercept, we can plot our data alongside the best fit line:

import matplotlib.pyplot as plt
 
def predict(x):
    return slope * x + intercept
 
fitLine = predict(pageSpeeds)
plt.scatter(pageSpeeds, purchaseAmount)
plt.plot(pageSpeeds, fitLine, c='r')
plt.show()

Best Fit Line

Remember, while linear regression is a potent tool, it does have its constraints. It's essential to understand the assumptions and limits when implementing it in practice.

Time Series Analysis Polynomial regression

Linear Regression