Polynomial Regression

Polynomial regression extends the concept of linear regression to fit data using higher-order polynomials. While linear regression models relationships with straight lines, polynomial regression can capture more complex, curved relationships.

First-Order Polynomial: $y = mx + b$
Second-Order Polynomial: $y = ax^2 + bx + c$
Third-Order Polynomial: $y = ax^3 + bx^2 + cx + d$

As the order increases, the complexity of curves that can be represented also rises. However, increasing the polynomial's degree doesn't always yield better results. Overcomplicating the model may result in overfitting, where the model performs exceptionally well on the training data but poorly on new, unseen data. Overfit models may appear to accommodate every outlier, potentially reducing its predictive accuracy for new data points. A common metric to measure the goodness of fit is $r^2$ , but a high $r^2$ doesn't always imply a good predictor.

Implementation with NumPy

NumPy's polyfit function provides an easy way to implement polynomial regression. The following example creates a relationship between pageSpeeds and a derived purchaseAmount:

from pylab import *
import numpy as np
 
np.random.seed(2)
pageSpeeds = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = np.random.normal(50.0, 10.0, 1000) / pageSpeeds
scatter(pageSpeeds, purchaseAmount)

The generated scatter plot shows a nonlinear relationship, suggesting a polynomial fit could be appropriate. Utilizing NumPy's polyfit() function, we can obtain a fourth-degree polynomial fit:

x = np.array(pageSpeeds)
y = np.array(purchaseAmount)
p4 = np.poly1d(np.polyfit(x, y, 4))

Plotting the original points against the predicted values:

import matplotlib.pyplot as plt
 
xp = np.linspace(0, 7, 100)
plt.scatter(x, y)
plt.plot(xp, p4(xp), c='r')
plt.show()

At this point, it looks like a reasonably good fit. What you want to ask yourself though is, "Am I overfitting? Does my curve look like it's actually going out of its way to accommodate outliers?" I

Computing the $r^2$ Error

The $r^2$ score evaluates the fit quality. It can be computed using the r2_score() function from sklearn.metrics:

from sklearn.metrics import r2_score
r2 = r2_score(y, p4(x))
print(r2)

For this example, the $r^2$ score is approximately 0.829, which indicates a relatively good fit to the data.

Linear regression Multivariate regression

Polynomial Regression

Implementation with NumPy

Computing the r2r^2r2 Error

Computing the $r^2$ Error