Linear Regression
Linear Regression Overview
Linear regression aims to fit a straight line to a set of observations, making it possible to predict new, unseen values. Essentially, it's about identifying a linear relationship between two variables. If we take a set of data points where Petal Width (on the -axis) is plotted against the Sepal Length (on the -axis), a linear relationship might be evident.
The fundamental equation for a straight line is , where represents the slope and is the -intercept. In our example, the computed slope is and the -intercept is .
Ordinary Least Squares (OLS)
OLS is the technique powering linear regression. It seeks to minimize the squared error between each data point and the regression line. Think of it as computing the variance, but relative to the line we're trying to define rather than the mean.
The equation for a straight line is . Given:
- are the observed values of the independent variable.
- are the observed values of the dependent variable.
- are the predicted values of the dependent variable.
- The linear regression line is represented as .
The goal of OLS is to minimize the sum of the squared differences (errors) between the observed values () and the predicted values (). Mathematically, this is represented as:
Given the equation of the line:
The error term can be written as:
To find the best values for and that minimize the error , we take the partial derivatives of with respect to and and set them equal to zero.
The formulas for (slope) and (intercept) are derived from these equations:
- Slope (): Where and are the means of and respectively.
Here, (slope) is computed as the correlation between the variables multiplied by the standard deviation of divided by the standard deviation of . Another equivalent expression for slope, using the correlation coefficient () and standard deviations of () and ():
- Intercept ():
(intercept) can be calculated as the mean of minus the product of slope and the mean of . With these equations, you can compute the coefficients and for any given dataset. Also, when you hear about maximum likelihood estimation in this context, know that it's just a fancier way of discussing linear regression.
Gradient Descent
While OLS is commonly used, gradient descent is an alternative method, especially beneficial for three-dimensional data. It seeks to determine the best fit by following the data's contours.
Co-efficient of Determination ()
How do we gauge the effectiveness of our regression? Enter , also known as the coefficient of determination. It measures how much of the total variation in is captured by your model. The formula to compute it is:
Where a value close to 0 indicates a poor fit and close to 1 signifies an excellent fit.
Implementing Linear Regression in Python
Want to see it in action? Let's create some random-ish linearly correlated data, plotting page rendering speeds against purchase amounts.
%matplotlib inline
import numpy as np
from pylab import *
pageSpeeds = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = 100 - (pageSpeeds + np.random.normal(0, 0.1, 1000)) * 3
scatter(pageSpeeds, purchaseAmount)
With the generated data in place, we can utilize SciPy to determine the best fit line.
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(pageSpeeds, purchaseAmount)
Having obtained the slope and intercept, we can plot our data alongside the best fit line:
import matplotlib.pyplot as plt
def predict(x):
return slope * x + intercept
fitLine = predict(pageSpeeds)
plt.scatter(pageSpeeds, purchaseAmount)
plt.plot(pageSpeeds, fitLine, c='r')
plt.show()
Remember, while linear regression is a potent tool, it does have its constraints. It's essential to understand the assumptions and limits when implementing it in practice.