Polynomial Regression
Polynomial regression extends the concept of linear regression to fit data using higher-order polynomials. While linear regression models relationships with straight lines, polynomial regression can capture more complex, curved relationships.
- First-Order Polynomial:
- Second-Order Polynomial:
- Third-Order Polynomial:
As the order increases, the complexity of curves that can be represented also rises. However, increasing the polynomial's degree doesn't always yield better results. Overcomplicating the model may result in overfitting, where the model performs exceptionally well on the training data but poorly on new, unseen data. Overfit models may appear to accommodate every outlier, potentially reducing its predictive accuracy for new data points. A common metric to measure the goodness of fit is , but a high doesn't always imply a good predictor.
Implementation with NumPy
NumPy's polyfit
function provides an easy way to implement polynomial regression. The following example creates a relationship between pageSpeeds
and a derived purchaseAmount
:
from pylab import *
import numpy as np
np.random.seed(2)
pageSpeeds = np.random.normal(3.0, 1.0, 1000)
purchaseAmount = np.random.normal(50.0, 10.0, 1000) / pageSpeeds
scatter(pageSpeeds, purchaseAmount)
The generated scatter plot shows a nonlinear relationship, suggesting a polynomial fit could be appropriate. Utilizing NumPy's polyfit()
function, we can obtain a fourth-degree polynomial fit:
x = np.array(pageSpeeds)
y = np.array(purchaseAmount)
p4 = np.poly1d(np.polyfit(x, y, 4))
Plotting the original points against the predicted values:
import matplotlib.pyplot as plt
xp = np.linspace(0, 7, 100)
plt.scatter(x, y)
plt.plot(xp, p4(xp), c='r')
plt.show()
At this point, it looks like a reasonably good fit. What you want to ask yourself though is, "Am I overfitting? Does my curve look like it's actually going out of its way to accommodate outliers?" I
Computing the Error
The score evaluates the fit quality. It can be computed using the r2_score()
function from sklearn.metrics
:
from sklearn.metrics import r2_score
r2 = r2_score(y, p4(x))
print(r2)
For this example, the score is approximately 0.829
, which indicates a relatively good fit to the data.