Dimensionality Reduction & Feature Extraction
Dimensionality reduction is instrumental in reducing higher-dimensional data into a lower-dimensional form, facilitating both visualization and computational efficiency. Prominent techniques in this realm are Principal Component Analysis (PCA), which is commonly used because it is adept at preserving maximal variance during the reduction process, Linear Discriminant Analysis (LDA), commonly used for feature extraction in supervised learning, and t-SNE, which is commonly used for visualization using 2-dimensional scatter plots. Notably, dimensionality reduction is not solely utilized for visualization, but also for data compression and feature extraction.
The Curse of Dimensionality (Why Bother with Dimensionality Reduction?)
The curse of dimensionality is a collection of phenomena that generally describes that, as dimensionality increases, the manageability and effectiveness of the data tend to decrease. On a high level, the curse of dimensionality is related to the fact that as dimensions (variables/features) are added to a data set, the average and minimum distance between points (records/observations) increase. This also results in exponential increases in computational efforts required for its processing and/or analysis.
Creating good predictions becomes more difficult as the distance between the known points and the unknown points increases. Additionally, features in your data set may not add much value or predictive power in the context of the target (independent) variable. These features do not improve the model but increase noise in the dataset, as well as the overall computational load of the model. Because of the curse of dimensionality, dimensionality reduction is often a critical component of analytic processes. Particularly in applications where the data has high-dimensionality, like computer vision or signal processing.
Too many features can be problematic as it leads to sparse data - every feature is a new dimension. Much of feature engineering is selecting the features most relevant to the problem at hand. This often is where domain knowledge comes into play.
Unsupervised dimensionality reduction techniques can also be employed to distill many features into fewer features. This includes PCA & K-Means.
Example: Implementing Feature Extraction Techniques
An example utilizing various data available on GitHub (opens in a new tab) illustrates the practical application of feature extraction techniques.
import warnings
import os
import time
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import seaborn as sns
# To ignore warnings any
warnings.filterwarnings('ignore')
# To decide display window width on console
DESIRED_WIDTH = 320
pd.set_option('display.width', DESIRED_WIDTH)
np.set_printoptions(linewidth=DESIRED_WIDTH)
pd.set_option('display.max_columns', 30)
INPUTTRAINFILE = "drive/MyDrive/data_extraction/segmentation.test"
INPUTTESTFILE = "drive/MyDrive/data_extraction/segmentation.data"
TRAINFILE = "drive/MyDrive/data_extraction/train.csv"
TESTFILE = "drive/MyDrive/data_extraction/test.csv"
ATTRIBUTES = None
We can apply various feature extraction techniques on the Dataset. Note that these functions are used in this section on Feature Extraction techniques.
def filldatasetfile(inputfile, outputfile):
"""
Creates the CSV File
:param inputfile:
:param outputfile:
:return:
"""
global ATTRIBUTES
nfirstlines = []
with open(inputfile) as _inp, open(outputfile, "w") as out:
for i in range(5):
if i == 3:
ATTRIBUTES = ['LABELS'] + next(_inp).rstrip('\n').split(',')
else:
nfirstlines.append(next(_inp))
for line in _inp:
out.write(line)
def extractdata():
"""
Extract and return the segmentation data in pandas dataframe format
:return:
"""
np.random.seed(0)
if os.path.exists(TRAINFILE):
os.remove(TRAINFILE)
if os.path.exists(TESTFILE):
os.remove(TESTFILE)
filldatasetfile(INPUTTRAINFILE, TRAINFILE)
filldatasetfile(INPUTTESTFILE, TESTFILE)
# Convert csv to pandas dataframe
traindata = pd.read_csv("drive/MyDrive/data_extraction/train.csv", header=None)
testdata = pd.read_csv("drive/MyDrive/data_extraction/test.csv", header=None)
traindata.columns = testdata.columns = ATTRIBUTES
# Shuffle the dataframe
traindata = traindata.sample(frac=1).reset_index(drop=True)
testdata = testdata.sample(frac=1).reset_index(drop=True)
return traindata, testdata
def preprocessdata(data):
"""
Preprocess the data with StandardScalar and Label Encoder
:param data: input dataframe of training or test set
"""
labels = data['LABELS']
features = data.drop(['LABELS'], axis=1)
columns = features.columns
enc = LabelEncoder()
enc.fit(labels)
labels = enc.transform(labels)
features = StandardScaler().fit_transform(features)
return features, labels, columns, data['LABELS']
def applyrandomforest(trainX, testX, trainY, testY):
"""
Apply Random forest on input dataset.
"""
start = time.process_time()
forest = RandomForestClassifier(n_estimators=700, max_features='sqrt', max_depth=15)
forest.fit(trainX, trainY)
print("Time Elapsed: %s secs" % (time.process_time() - start))
prediction = forest.predict(testX)
print("Classification Report after applying Random Forest: ")
print("----------------------------------------------------")
print(classification_report(testY, prediction))
def applypca1(trainX, testX, trainY, testY, columns):
"""
Apply PCA on the dataset
"""
# Fitting the PCA algorithm with our Data
pca = PCA()
pca.fit(trainX)
# Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance') # for each component
plt.title('Segmentation Dataset Explained Variance')
plt.show(block=True)
def applypca2(trainX, testX, trainY, testY, columns):
pca = PCA(n_components=12)
pca.fit(Xtrain)
trainX_pca = pca.transform(trainX)
testX_pca = pca.transform(testX)
applyrandomforest(trainX_pca, testX_pca, trainY, testY)
def applypca3(trainX, testX, trainY, testY, columns):
# Visualizing the PCA coefficients using a heat map
plt.matshow(pca.components_, cmap='viridis')
plt.yticks(range(12), range(12))
plt.colorbar()
plt.xticks(range(len(columns)), columns, rotation=60, ha='left')
plt.xlabel('Feature')
plt.ylabel('Principal Components')
plt.show(block=True)
def applylda(trainX, testX, trainY, testY, actual_labels):
lda = LinearDiscriminantAnalysis()
lda.fit(trainX, trainY)
# Plotting the Cumulative Summation of the Explained Variance
plt.figure()
plt.plot(np.cumsum(lda.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance') # for each component
plt.title('Segmentation Dataset Explained Variance')
plt.show(block=True)
lda = LinearDiscriminantAnalysis(n_components=5)
lda.fit(trainX, trainY)
trainX_lda = lda.transform(trainX)
testX_lda = lda.transform(testX)
# Plot Pairwise relationship between LDA components
plt.figure(figsize=(10, 8), dpi=80)
visualizedf = pd.DataFrame(trainX_lda, columns=['LDA1', 'LDA2', 'LDA3', 'LDA4', 'LDA5'])
visualizedf = pd.concat([visualizedf, pd.DataFrame(actual_labels, columns=['LABELS'])], axis=1)
print(visualizedf.sample(n=5))
sns.pairplot(visualizedf, vars=visualizedf.columns[:-1], hue="LABELS", palette="husl")
plt.show(block=True)
applyrandomforest(trainX_lda, testX_lda, trainY, testY)
def applytsne1(trainX, trainY):
lda = LinearDiscriminantAnalysis(n_components=2)
lda.fit(trainX, trainY)
# transform the data onto the first two principal components
trainX_lda = lda.transform(trainX)
colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525",
"#A83683", "#4E655E"]
plt.figure(figsize=(10, 10))
plt.xlim(trainX_lda[:, 0].min(), trainX_lda[:, 0].max())
plt.ylim(trainX_lda[:, 1].min(), trainX_lda[:, 1].max())
for i in range(len(trainX_lda)):
# actually plot the digits as text instead of using scatter
plt.text(trainX_lda[i, 0], trainX_lda[i, 1], str(trainY[i]),
color=colors[trainY[i]], fontdict={'weight': 'bold', 'size': 9})
plt.xlabel("LDA 0")
plt.ylabel("LDA 1")
plt.show(block=True)
def applytsne2(trainX, trainY):
colors = ["#476A2A", "#7851B8", "#BD3430", "#4A2D4E", "#875525",
"#A83683", "#4E655E"]
# Apply tSNE from Manifold learning for better visualization
tsne = TSNE(random_state=21)
# use fit_transform instead of fit, as TSNE has no transform method
trainX_tsne = tsne.fit_transform(trainX)
plt.figure(figsize=(10, 10))
plt.xlim(trainX_tsne[:, 0].min(), trainX_tsne[:, 0].max())
plt.ylim(trainX_tsne[:, 1].min(), trainX_tsne[:, 1].max())
for i in range(len(trainX_tsne)):
# actually plot the digits as text instead of using scatter
plt.text(trainX_tsne[i, 0], trainX_tsne[i, 1], str(trainY[i]),
color=colors[trainY[i]], fontdict={'weight': 'bold', 'size': 9})
plt.xlabel("t-SNE feature 0")
plt.ylabel("t-SNE feature 1")
plt.show(block=True)
The code fragment demonstrates initial data handling and extraction, featuring several functions that facilitate the application of feature extraction techniques, specifically:
filldatasetfile(inputfile, outputfile)
: Constructs a CSV file from input data.extractdata()
: Retrieves and processes the segmentation data into a DataFrame.preprocessdata(data)
: Applies standard scalar and label encoding to data.applyrandomforest(trainX, testX, trainY, testY)
: Implements a Random Forest classifier.
Subsequent functions, such as applypca1(trainX, testX, trainY, testY, columns)
and applypca2(trainX, testX, trainY, testY, columns)
, demonstrate PCA application, while applylda(trainX, testX, trainY, testY, actual_labels)
reveals LDA usage, and applytsne1(trainX, trainY)
and applytsne2(trainX, trainY)
showcase t-SNE application.
The data extraction process and head of the TRAINDATA
are represented by the following code and output, respectively:
TRAINDATA, TESTDATA = extractdata()
print(TRAINDATA.head(n=5))
In sum, the dimensionality reduction and feature extraction techniques are pivotal for managing high-dimensional data, providing actionable insights, visualizations, and computational efficiencies in various machine learning applications.