Multivariate Regression

Introduction to Multivariate Regression

Multivariate regression allows for the prediction of a value based on multiple attributes. It advances the idea beyond linear regression by considering multiple factors that influence the outcome. For instance, predicting the price of a car might depend on several attributes like mileage, brand, and the age of the car. In multivariate regression, each feature has a coefficient that indicates its importance in predicting the output. An example of a price model for a car might look like:

\text{price} = \alpha + \beta_1 \text{ mileage } + \beta_2 \text{ age } + \beta_2 \text{ doors }

Coefficients derived from the least squares analysis can be used to understand the significance of each feature in the prediction model. A smaller coefficient implies the associated feature has less impact. Simplicity in models is preferred; unnecessary complexities might not add value and should be avoided. It's also crucial to ensure the factors used in multivariate regression are not dependent on each other. This independence assumption might not always hold true.

Python Implementation of Multivariate Regression

The statsmodel package in Python offers an easy way to perform multivariate regression. An example involves predicting car values using data from the Kelley Blue Book. The process involves:

1. Importing and examining the dataset using `pandas`:

import pandas as pd
df = pd.read_excel('http://cdn.sundog-soft.com/Udemy/DataScience/cars.xls')
df.head()

	Price	Mileage	Make	Model	Trim	Type	Cylinder	Liter	Doors	Cruise	Sound	Leather
0	17314.103129	8221	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	1
1	17542.036083	9135	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
2	16218.847862	13196	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	1	0
3	16336.913140	16342	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	0
4	16339.170324	19832	Buick	Century	Sedan 4D	Sedan	6	3.1	4	1	0	1

2. Converting textual data to numerical codes and setting up the features for the regression:

import statsmodels.api as sm
 
df['Model_ord'] = pd.Categorical(df.Model).codes
X = df[['Mileage', 'Model_ord', 'Doors']]
y = df[['Price']]
 
X1 = sm.add_constant(X)
est = sm.OLS(y, X1).fit()
 
est.summary()

Dep. Variable:	Price	R-squared:	0.042
Model:	OLS	Adj. R-squared:	0.038
Method:	Least Squares	F-statistic:	11.57
Date:	Thu, 21 Sep 2023	Prob (F-statistic):	1.98e-07
Time:	20:54:21	Log-Likelihood:	-8519.1
No. Observations:	804	AIC:	1.705e+04
Df Residuals:	800	BIC:	1.706e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P> \|t\|	[0.025	0.975]
const	3.125e+04	1809.549	17.272	0.000	2.77e+04	3.48e+04
Mileage	-0.1765	0.042	-4.227	0.000	-0.259	-0.095
Model_ord	-39.0387	39.326	-0.993	0.321	-116.234	38.157
Doors	-1652.9303	402.649	-4.105	0.000	-2443.303	-862.558

Omnibus:	206.410	Durbin-Watson:	0.080
Prob(Omnibus):	0.000	Jarque-Bera (JB):	470.872
Skew:	1.379	Prob(JB):	5.64e-103
Kurtosis:	5.541	Cond. No.	1.15e+05

Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.15e+05. This might indicate that there are strong multicollinearity or other numerical problems.

3. Analyzing the model summary. Key insights include:

R-squared value.
Coefficients of each feature.
Standard errors to determine the significance of each attribute.

4. Additional analyses can be conducted to better understand the impact of specific features. For instance, examining the mean price for the given number of doors:

y.groupby(df.Doors).mean()

	Price
Doors
2	23807.135520
4	20580.670749

This shows a surprising negative correlation between the number of doors and the price. Although, it is a small dataset and the results might not be statistically significant.

Polynomial regression Multi-level models