Filter Methods (Univariate Selection)
Filter Methods considers the relationship between features and the target variable to compute the importance of features. Univariate feature selection works by selecting the best features based on univariate statistical tests. It can be seen as a preprocessing step to an estimator.
As the name suggest, in this method, you filter and take only the subset of the relevant features. The model is built after selecting the features. The filtering here is done using correlation matrix and it is most commonly done using Pearson correlation.
Univariate test: meaning it does not consider multiple variables together and their possible interactions.
Statistical tests can be used to select those features that have the strongest relationship with the output variable. The scikit-learn
library provides the SelectKBest
class that can be used with a suite of different statistical tests to select a specific number of features.
Many different statistical test scan be used with this selection method. For example the ANOVA F-value method is appropriate for numerical inputs and categorical data.
Pearson's Correlation Coefficient
Here, first, we will first plot the Pearson correlation heatmap and see the correlation of independent variables with the output variable MEDV
. We will only select features which has correlation of above 0.5 (taking absolute value) with the output variable.
The correlation coefficient has values between -1 to 1
- A value closer to 0 implies weaker correlation (exact 0 implying no correlation)
- A value closer to 1 implies stronger positive correlation
- A value closer to -1 implies stronger negative correlation
#importing libraries
# We will be selecting features using the above listed methods for the regression problem of predicting the “MEDV” column.
%matplotlib inline
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
#Loading the dataset
x = load_boston()
df = pd.DataFrame(x.data, columns = x.feature_names)
df["MEDV"] = x.target
X = df.drop("MEDV",1) #Feature Matrix
y = df["MEDV"] #Target Variable
df.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = df.corr(method='pearson')
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
#Correlation with output variable
cor_target = abs(cor["MEDV"])
#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
relevant_features
RM 0.695360
PTRATIO 0.507787
LSTAT 0.737663
MEDV 1.000000
Name: MEDV, dtype: float64
As we can see, only the features RM, PTRATIO and LSTAT are highly correlated with the output variable MEDV. Hence we will drop all other features apart from these. However this is not the end of the process. One of the assumptions of linear regression is that the independent variables need to be uncorrelated with each other. If these variables are correlated with each other, then we need to keep only one of them and drop the rest. So let us check the correlation of selected features with each other. This can be done either by visually checking it from the above correlation matrix or from the code snippet below.
print(df[["LSTAT","PTRATIO"]].corr())
print(df[["RM","LSTAT"]].corr())
LSTAT PTRATIO
LSTAT 1.000000 0.374044
PTRATIO 0.374044 1.000000
RM LSTAT
RM 1.000000 -0.613808
LSTAT -0.613808 1.000000
From the above code, it is seen that the variables RM and LSTAT are highly correlated with each other (-0.613808). Hence we would keep only one variable and drop the other. We will keep LSTAT since its correlation with MEDV is higher than that of RM. After dropping RM, we are left with two feature, LSTAT and PTRATIO. These are the final features given by Pearson correlation.
Kendall Tau correlation coefficient
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = df.corr(method='kendall')
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
Spearman rank correlation
#Using Pearson Correlation
plt.figure(figsize=(12,10))
cor = df.corr(method='spearman')
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
ANOVA F-value method
Scikit learn provides the Selecting K best features using F-Test sklearn.feature_selection.f_regression
as well as, for Classification tasks, sklearn.feature_selection.f_classif
.
There are some drawbacks of using F-Test to select your features. F-Test checks for and only captures linear relationships between features and labels. A highly correlated feature is given higher score and less correlated features are given lower score.
Correlation is highly deceptive as it doesn’t capture strong non-linear relationships. Also, Using summary statistics like correlation may be a bad idea, as illustrated by Anscombe’s quartet (opens in a new tab).
Francis Anscombe illustrates how four distinct datasets have same mean, variance and correlation to emphasize ‘summary statistics’ does not completely describe the datasets and can be quite deceptive
# Feature Selection with Univariate Statistical Tests
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
#load data
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv', names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, Y)
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])
[ 39.67 213.162 3.257 4.304 13.281 71.772 23.871 46.141]
[[ 6. 148. 33.6 50. ]
[ 1. 85. 26.6 31. ]
[ 8. 183. 23.3 32. ]
[ 1. 89. 28.1 21. ]
[ 0. 137. 43.1 33. ]]
You can see the scores for each attribute and the 4 attributes chosen (those with the highest scores). Specifically features with indexes 0 (preq
), 1 (plas
), 5 (mass
), and 7 (age
). The methods based on F-test estimate the degree of linear dependency between two random variables. On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation
Feature selection with sparse data: If you use sparse data (i.e. data
represented as sparse matrices), chi2
, mutual_info_regression
,
mutual_info_classif
will deal with the data without making it dense.
Beware not to use a regression scoring function with a classification problem, you will get useless results.
: Chi Square Test
from sklearn.feature_selection import chi2
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])
[ 111.52 1411.887 17.605 53.108 2175.565 127.669 5.393 181.304]
[[148. 0. 33.6 50. ]
[ 85. 0. 26.6 31. ]
[183. 0. 23.3 32. ]
[ 89. 94. 28.1 21. ]
[137. 168. 43.1 33. ]]
Other Methods:
This same sklearn.feature_selection.SelectKBest
has other statistical methods available:
-
f_classif
: ANOVA F-value between label/feature for classification tasks. -
mutual_info_classif
: Mutual information for a discrete target. -
chi2
: Chi-squared stats of non-negative features for classification tasks. -
f_regression
: F-value between label/feature for regression tasks. -
mutual_info_regression
: Mutual information for a continuous target. -
SelectPercentile
: Select features based on percentile of the highest scores. -
SelectFpr
: Select features based on a false positive rate test. -
SelectFdr
: Select features based on an estimated false discovery rate. -
SelectFwe
: Select features based on family-wise error rate. -
GenericUnivariateSelect
: Univariate feature selector with configurable mode. -
For regression:
f_regression
,mutual_info_regression
-
For classification:
chi2
,f_classif
,mutual_info_classif