Artificial Intelligence 🤖
Embedded Methods

Embedded Methods

Embedded methods are iterative in a sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration. Regularization methods are the most commonly used embedded methods which penalize a feature given a coefficient threshold.

Least Absolute Shrinkage and Selection Operator (LASSO)

L1-recovery and compressive sensing:

For a good choice of alpha, the Lasso can fully recover the exact set of non-zero variables using only few observations, provided certain specific conditions are met. In particular, the number of samples should be 'sufficiently large', or L1 models will perform at random, where 'sufficiently large' depends on the number of non-zero coefficients, the logarithm of the number of features, the amount of noise, the smallest absolute value of non-zero coefficients, and the structure of the design matrix XX. In addition, the design matrix must display certain specific properties, such as not being too correlated.

There is no general rule to select an alpha parameter for recovery of non-zero coefficients. It can by set by cross-validation (LassoCV or LassoLarsCV), though this may lead to under-penalized models: including a small number of non-relevant variables is not detrimental to prediction score. BIC (LassoLarsIC) tends, on the opposite, to set high values of alpha.

Reference: Richard G. Baraniuk 'Compressive Sensing', IEEE Signal Processing Magazine July 2007

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Lasso
 
# import dataset
from sklearn.datasets import load_diabetes
X,y = load_diabetes(return_X_y=True)
features = load_diabetes()['feature_names']
 
# split dataset to test and train, and calcukate on train only
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
 
# build our model, optimize its hyperparameter and train it on the training dataset.
# Since our dataset needs to be scaled in advance, we can make use of the powerful Pipeline object in scikit-learn.
# Our pipeline is made by a StandardScaler and the Lasso object itself.
pipeline = Pipeline([('scaler',StandardScaler()), ('model',Lasso())])

For this example, we are going to test several values from 0.1 to 10 with 0.1 step. For each value, we calculate the average value of the mean squared error in a 5-folds cross-validation and select the value of α\alpha that minimizes such average performance metrics. We can use the GridSearchCV object for this purpose.

# optimize the α hyperparameter of Lasso regression
search = GridSearchCV(pipeline,
                      {'model__alpha':np.arange(0.1,10,0.1)},
                      cv = 5, scoring="neg_mean_squared_error",verbose=3
                      )

We use neg_mean_squared_error because the grid search tries to maximize the performance metrics, so we add a minus sign to minimize the mean squared error.

# fit grid search
search.fit(X_train,y_train)
print(f'The best value for α is: {search.best_params_}')
The best value for α is: {'model__alpha': 1.2000000000000002}
# get the values of the coefficients of Lasso regression.
coefficients = search.best_estimator_.named_steps['model'].coef_
# The importance of a feature is the absolute value of its coefficient, so:
importance = np.abs(coefficients)
importance
array([ 0.151,  9.005, 26.902, 18.049,  5.419,  0.   , 12.279,  0.   ,
       19.489,  0.   ])

As we can see, there are 3 features with 0 importance. Those features have been discarded by our model.

print('The features that survived the Lasso regression are:')
print(np.array(features)[importance > 0])
print('The discarded features are:')
print(np.array(features)[importance == 0])
The features that survived the Lasso regression are:
    ['age' 'sex' 'bmi' 'bp' 's1' 's3' 's5']

The discarded features are:
    ['s2' 's4' 's6']

Select an alpha parameter for recovery of non-zero coefficients by cross-validation i.e. LassoCV

#importing libraries
# We will be selecting features using the above listed methods for the regression problem of predicting the “MEDV” column.
%matplotlib inline
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
 
#Loading the dataset
x = load_boston()
df = pd.DataFrame(x.data, columns = x.feature_names)
df["MEDV"] = x.target
X = df.drop("MEDV",1)   #Feature Matrix
y = df["MEDV"]          #Target Variable
 
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
Best alpha using built-in LassoCV: 0.724820
Best score using built-in LassoCV: 0.702444
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")
Lasso picked 10 variables and eliminated the other 3 variables
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")

png

Decision Trees

Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features. Tree based models calculates feature importance for they need to keep the best performing features as close to the root of the tree. Constructing a decision tree involves calculating the best predictive feature. The feature importance in tree based models are calculated based on Gini Index, Entropy or Chi-Square value.

In the example below we construct a ExtraTreesClassifier classifier for the Pima Indians onset of diabetes dataset.

from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
# load data
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X, Y)
print(model.feature_importances_)
[0.111 0.238 0.098 0.079 0.076 0.145 0.118 0.135]

You can see that we are given an importance score for each attribute where the larger score the more important the attribute. The scores suggest at the importance of plas, age and mass.

Linear Support Vector Classification (LinearSVC)

Selecting From Model

SelectFromModel is a meta-transformer that can be used alongside any estimator that assigns importance to each feature through a specific attribute (such as coef, feature_importances) or via an importance_getter callable after fitting. The features are considered unimportant and removed if the corresponding importance of the feature values are below the provided threshold parameter. Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument. Available heuristics are mean, median and float multiples of these like 0.1×mean0.1 \times \text{mean}. In combination with the threshold criteria, one can use the max_features parameter to set a limit on the number of features to select.

L1-based feature selection

Linear models penalized with the L1 norm have sparse solutions: many of their estimated coefficients are zero. When the goal is to reduce the dimensionality of the data to use with another classifier, they can be used along with SelectFromModel to select the non-zero coefficients. In particular, sparse estimators useful for this purpose are:

  • Lasso for regression
  • LogisticRegression for regression
  • LinearSVC for classification
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
print(f'Initial X Shape: {X.shape}')
 
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print(f'New X Shape: {X_new.shape}')
Initial X Shape: (150, 4)
New X Shape: (150, 3)

With SVMs and logistic-regression, the parameter C controls the sparsity: the smaller C the fewer features selected. With Lasso, the higher the alpha parameter, the fewer features selected.

Genetic Algorithms

This was informed mostly from this post (opens in a new tab). I am using a genetic algorithm for feature selection. But, a genetic algorithm can also be used for hyper-parameter optimization. Because the steps are pretty straightforward and generalized, it applies to many different areas.

Genetic Algorithm

Selecting features is an NP-Hard problem. The optimal configuration is a set or subset of those features, given a set of features. This method is a discrete selection. With a permutation of possibilities, it is very costly to determine the optimal feature set.

Genetic algorithms use an approach to determine an optimal set based on evolution. For feature selection, the first step is to generate a population based on subsets of the possible features.

From this population, the subsets are evaluated using a predictive model for the target task. Once each member of the population is considered, a tournament is performed to determine which subsets will continue into the next generation.

The next generation is composed of the tournament winners, with some cross over (update the winning feature sets with features from the other winners) and mutation (introduce or remove some features at random).

  1. An initial population is produced.
  2. A score is attached to the members of the population.
  3. A subset is selected for reproduction with a tournament.
  4. Select genetic material to pass on.
  5. Apply mutations.
  6. Repeat over multiple generations.

The algorithm runs for a set number of generations (iterations). After which, the optimal member of the population is the selected features.

The experiments are based on the UCI breast cancer dataset, which contains 569 instances and 30 features. With this dataset, I test several classifiers with all of the features, the subset of features from the genetic algorithm, and five features using the chi-squared test for comparison. Below is the code used to select up to five features using a genetic algorithm. After installing the sklearn-genetic package, the code is straightforward.

from sklearn.datasets import load_breast_cancer
from genetic_selection import GeneticSelectionCV
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
X = df.drop(['target'], axis=1)
y = df['target'].astype(float)
estimator = DecisionTreeClassifier()
model = GeneticSelectionCV(
    estimator, cv=5, verbose=0,
    scoring="accuracy", max_features=5,
    n_population=100, crossover_proba=0.5,
    mutation_proba=0.2, n_generations=50,
    crossover_independent_proba=0.5,
    mutation_independent_proba=0.04,
    tournament_size=3, n_gen_no_change=10,
    caching=True, n_jobs=-1)
model = model.fit(X, y)
print('Features:', X.columns[model.support_])
Features: Index(['mean texture', 'mean concavity', 'worst radius', 'worst concave points'],
          dtype='object')

GeneticSelectionCV

The initial population (of size n_population) is generated at random from the sample space of feature sets. These sets are limited in scope by the parameter max_features, which sets the maximum size of each feature subset. For each member of the initial population, a score is measured with the target metric. This measurement is the performance of the estimator specified.

A tournament selection is performed to determine which members will continue to the next generation. The number of members within the tournament is set with tournament_size. Tournament size is a selection of a few members from the population that compete against one another based on the scoring metric. The winner of a tournament is chosen as a parent for the next generation.

The number of members for the tournament should remain small. When the value is quite large, the current best member is usually selected. This behaviour causes none of the weaker members to be selected. While providing temporary performance gains, ultimately, this leads to a reduced performance overall as the weaker options are not given a chance to improve.