Multivariate Regression
Introduction to Multivariate Regression
Multivariate regression allows for the prediction of a value based on multiple attributes. It advances the idea beyond linear regression by considering multiple factors that influence the outcome. For instance, predicting the price of a car might depend on several attributes like mileage, brand, and the age of the car. In multivariate regression, each feature has a coefficient that indicates its importance in predicting the output. An example of a price model for a car might look like:
Coefficients derived from the least squares analysis can be used to understand the significance of each feature in the prediction model. A smaller coefficient implies the associated feature has less impact. Simplicity in models is preferred; unnecessary complexities might not add value and should be avoided. It's also crucial to ensure the factors used in multivariate regression are not dependent on each other. This independence assumption might not always hold true.
Python Implementation of Multivariate Regression
The statsmodel
package in Python offers an easy way to perform multivariate regression. An example involves predicting car values using data from the Kelley Blue Book. The process involves:
1. Importing and examining the dataset using pandas
:
import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
df.head()
Price | Mileage | Make | Model | Trim | Type | Cylinder | Liter | Doors | Cruise | Sound | Leather | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17314.103129 | 8221 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 1 |
1 | 17542.036083 | 9135 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 0 |
2 | 16218.847862 | 13196 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 1 | 0 |
3 | 16336.913140 | 16342 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 0 | 0 |
4 | 16339.170324 | 19832 | Buick | Century | Sedan 4D | Sedan | 6 | 3.1 | 4 | 1 | 0 | 1 |
2. Converting textual data to numerical codes and setting up the features for the regression:
import statsmodels.api as sm
df['Model_ord'] = pd.Categorical(df.Model).codes
X = df[['Mileage', 'Model_ord', 'Doors']]
y = df[['Price']]
X1 = sm.add_constant(X)
est = sm.OLS(y, X1).fit()
est.summary()
Dep. Variable: | Price | R-squared: | 0.042 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.038 |
Method: | Least Squares | F-statistic: | 11.57 |
Date: | Thu, 21 Sep 2023 | Prob (F-statistic): | 1.98e-07 |
Time: | 20:54:21 | Log-Likelihood: | -8519.1 |
No. Observations: | 804 | AIC: | 1.705e+04 |
Df Residuals: | 800 | BIC: | 1.706e+04 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P> |t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 3.125e+04 | 1809.549 | 17.272 | 0.000 | 2.77e+04 | 3.48e+04 |
Mileage | -0.1765 | 0.042 | -4.227 | 0.000 | -0.259 | -0.095 |
Model_ord | -39.0387 | 39.326 | -0.993 | 0.321 | -116.234 | 38.157 |
Doors | -1652.9303 | 402.649 | -4.105 | 0.000 | -2443.303 | -862.558 |
Omnibus: | 206.410 | Durbin-Watson: | 0.080 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 470.872 |
Skew: | 1.379 | Prob(JB): | 5.64e-103 |
Kurtosis: | 5.541 | Cond. No. | 1.15e+05 |
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.15e+05. This might indicate that there are strong multicollinearity or other numerical problems.
3. Analyzing the model summary. Key insights include:
- R-squared value.
- Coefficients of each feature.
- Standard errors to determine the significance of each attribute.
4. Additional analyses can be conducted to better understand the impact of specific features. For instance, examining the mean price for the given number of doors:
y.groupby(df.Doors).mean()
Price | |
---|---|
Doors | |
2 | 23807.135520 |
4 | 20580.670749 |
This shows a surprising negative correlation between the number of doors and the price. Although, it is a small dataset and the results might not be statistically significant.