Artificial Intelligence 🤖
Predictive Models
Multivariate regression

Multivariate Regression

Introduction to Multivariate Regression

Multivariate regression allows for the prediction of a value based on multiple attributes. It advances the idea beyond linear regression by considering multiple factors that influence the outcome. For instance, predicting the price of a car might depend on several attributes like mileage, brand, and the age of the car. In multivariate regression, each feature has a coefficient that indicates its importance in predicting the output. An example of a price model for a car might look like:

price=α+β1 mileage +β2 age +β2 doors \text{price} = \alpha + \beta_1 \text{ mileage } + \beta_2 \text{ age } + \beta_2 \text{ doors }

Coefficients derived from the least squares analysis can be used to understand the significance of each feature in the prediction model. A smaller coefficient implies the associated feature has less impact. Simplicity in models is preferred; unnecessary complexities might not add value and should be avoided. It's also crucial to ensure the factors used in multivariate regression are not dependent on each other. This independence assumption might not always hold true.

Python Implementation of Multivariate Regression

The statsmodel package in Python offers an easy way to perform multivariate regression. An example involves predicting car values using data from the Kelley Blue Book. The process involves:

1. Importing and examining the dataset using pandas:

import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
df.head()
PriceMileageMakeModelTrimTypeCylinderLiterDoorsCruiseSoundLeather
017314.1031298221BuickCenturySedan 4DSedan63.14111
117542.0360839135BuickCenturySedan 4DSedan63.14110
216218.84786213196BuickCenturySedan 4DSedan63.14110
316336.91314016342BuickCenturySedan 4DSedan63.14100
416339.17032419832BuickCenturySedan 4DSedan63.14101

2. Converting textual data to numerical codes and setting up the features for the regression:

import statsmodels.api as sm
 
df['Model_ord'] = pd.Categorical(df.Model).codes
X = df[['Mileage', 'Model_ord', 'Doors']]
y = df[['Price']]
 
X1 = sm.add_constant(X)
est = sm.OLS(y, X1).fit()
 
est.summary()
Dep. Variable:PriceR-squared:0.042
Model:OLSAdj. R-squared:0.038
Method:Least SquaresF-statistic:11.57
Date:Thu, 21 Sep 2023Prob (F-statistic):1.98e-07
Time:20:54:21Log-Likelihood:-8519.1
No. Observations:804AIC:1.705e+04
Df Residuals:800BIC:1.706e+04
Df Model:3
Covariance Type:nonrobust
coefstd errtP> |t|[0.0250.975]
const3.125e+041809.54917.2720.0002.77e+043.48e+04
Mileage-0.17650.042-4.2270.000-0.259-0.095
Model_ord-39.038739.326-0.9930.321-116.23438.157
Doors-1652.9303402.649-4.1050.000-2443.303-862.558
Omnibus:206.410Durbin-Watson:0.080
Prob(Omnibus):0.000Jarque-Bera (JB):470.872
Skew:1.379Prob(JB):5.64e-103
Kurtosis:5.541Cond. No.1.15e+05

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.15e+05. This might indicate that there are strong multicollinearity or other numerical problems.

3. Analyzing the model summary. Key insights include:

  • R-squared value.
  • Coefficients of each feature.
  • Standard errors to determine the significance of each attribute.

4. Additional analyses can be conducted to better understand the impact of specific features. For instance, examining the mean price for the given number of doors:

y.groupby(df.Doors).mean()
Price
Doors
223807.135520
420580.670749

This shows a surprising negative correlation between the number of doors and the price. Although, it is a small dataset and the results might not be statistically significant.