Support Vector Machines (SVM)

Support Vector Machines (SVM) is a powerful supervised statistical method for classifying higher-dimensional data. Unlike simpler unsupervised methods like k-means clustering, SVM handles multi-dimensional data well. SVM operates by finding higher-dimensional support vectors across which to divide the data into separate clusters in the feature space. This is done using "support vectors," which are essentially data points that help define the optimal hyperplane. SVM finds higher dimensional support vectors that define the higher-dimensional planes that split the data into different clusters by using mathematical techniques like the "kernel trick" to find these optimal separators. Different types of kernels can be used based on the problem you're tackling.

The "kernel trick" is an important part of how SVM works. While the underlying math is complex, in summary the kernel trick helps the algorithm efficiently find hyperplanes in high dimensions. Different types of kernels (linear, polynomial, etc.) can give different results and have varying computational costs. Choose wisely based on the data.

The advantages of support vector machines are:

Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
SVM is not sensitive to outliers
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
Choosing a good kernel is not easy
The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune these hyper-parameters. It is hard to visualize their impact
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities (opens in a new tab)).

One example that you often see with SVMs is the classic Iris dataset (including the scikit-learn (opens in a new tab) docs). This dataset classifies different species of Iris flowers based on four features: the lengths and widths of their petals and sepals (a little support structure underneath the petal). Visualizing this multi-dimensional data can be simplified by projecting it down to two dimensions, such as sepal width and length.

The kernel trick allows for various forms of decision boundaries, from linear to polynomial to even more exotic curves. But watch out—more; complex kernels demand more computational power and can lead you into the quagmire of overfitting. Always perform train/test splits to fine-tune your model and possibly consider ensemble methods.

Finding the optimal kernel and its parameters is no trivial task; it's a delicate balance that requires some trial and error.

How SVM Works

How does SVM select hyperplanes? There can be an infinite number of hyperplanes passing through a point and classifying the classes perfectly. So, which one is the best? SVM does this by finding the maximum margin between the hyperplanes i.e. the maximum distances between the two classes. There are 2 types of SVM Algorithms:

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line (if it is 2D) or a plane (if it is 3D), then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.
Non-Linear SVM: When the data is not linearly separable then we can use Non-Linear SVM with advanced techniques like kernel tricks to classify them. In most real-world applications, we do not find linearly separable data points.

Let's define two main terms which will be repeated again and again in this article

Support Vectors: These are the points that are closest to the hyperplane. A separating line will be defined with the help of these data points.
Margin: it is the distance between the hyperplane and the observations closest to the hyperplane (support vectors). In SVM large margin is considered a good margin. There are two types of margins: hard margin and soft margin.

SVM Margin

Linear SVM is defined such that it is defined in terms of the support vectors only - we don’t have to worry about other observations since the margin is made using the points which are closest to the hyperplane (support vectors). Suppose we have a dataset that has two classes (green and blue). We want to classify that the new data point as either blue or green. To classify these points, we can have many decision boundaries (hyperplanes), but the question is which is the best and how do we find it?

SVM Hyperplanes

The best hyperplane is that plane that has the maximum distance from both the classes. This is done by finding different hyperplanes which classify the labels and then choosing the one which is farthest from the data points or the one which has the maximum margin.

SVM Optimal Hyperplanes

In real-life applications we don’t find any dataset which is linearly separable, what we’ll find is either an almost linearly separable dataset or a non-linearly separable dataset. This is the case for Soft Margin SVM, where we add 2 terms to the optimisation equation to penalise incorrectly classified points, so that we can say that the SVM Error = Margin Error + Classification Error.

Note on Kernels

The most interesting feature of SVM is that it can even work with a non-linear dataset and for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a dataset like below - we cannot draw a single hyperplane which can classify the points correctly. So what we do is convert this lower dimension space to a higher dimension space which will allow us to find a decision boundary that clearly divides the data points.

SVM Non-Linear Dataset

These functions that help us do this are called Kernels. Some common ones include:

Polynomial Kernel
Sigmoid Kernel
RBF Kernel
Bessel function kernel
Anova Kernel

Which kernel to use is purely determined by what kind of dataset are you working on. If it is linearly separable then you should really opt for linear kernel function since it is very easy to use and the complexity is much lower compared to other kernel functions. It's recommended to start with a hypothesis that your data is linearly separable and choose a linear kernel function. You can then work your way up towards the more complex kernel functions. Usually, we use SVM with RBF and linear kernel function because other kernels like polynomial kernel have poor efficiency.

Hands-on Example: Classifying People

For a practical example, let's consider a fabricated dataset containing the age and income of 100 people, clustered around 5 different points:

import numpy as np
 
# Create fake income/age clusters for N people in k clusters
def createClusteredData(N, k):
    pointsPerCluster = float(N)/k
    X = []
    y = []
    for i in range (k):
        incomeCentroid = np.random.uniform(20000.0, 200000.0)
        ageCentroid = np.random.uniform(20.0, 70.0)
        for j in range(int(pointsPerCluster)):
            X.append([np.random.normal(incomeCentroid, 10000.0), np.random.normal(ageCentroid, 2.0)])
            y.append(i)
    X = np.array(X)
    y = np.array(y)
    return X, y

You can visualise a scatter plot to illustrate those, and see where they land up:

%matplotlib inline
from pylab import *
 
(X, y) = createClusteredData(100, 5)
 
plt.figure(figsize=(8, 6))
plt.scatter(X[:,0], X[:,1], c=y.astype(np.float))
plt.show()

To use Linear SVC (SVC is a form of SVM), We're going to use SVM with a linear kernel, and with a C value of $1.0$ . $C$ is just an error penalty term that you can adjust; it's 1 by default.

from sklearn import svm, datasets
 
C = 1.0
svc = svm.SVC(kernel='linear', C=C).fit(X, y)

Visualising the the classification ranges and SVC, we do the following. In simple terms of how this works, it's creating a mesh across the entire grid, and it will plot different classifications from the SVC models as different colors on that grid, and then we plot our original data on top of that:

def plotPredictions(clf):
    xx, yy = np.meshgrid(np.arange(0, 250000, 10),
                     np.arange(10, 70, 0.5))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
 
    plt.figure(figsize=(8, 6))
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
    plt.scatter(X[:,0], X[:,1], c=y.astype(np.float))
    plt.show()
 
plotPredictions(svc)

SVC is computationally expensive, so it takes a long time to run:

SVM with Scikit-learn

In the image, the shaded areas represent the clusters found by SVM. It does a good job, given that it had to draw straight lines, even if it's not perfect.

Making Predictions

After training the model, making predictions is straightforward. For example, if you want to classify someone who makes $200,000 a year and is 40 years old:

print(svc.predict([[200000, 40]]))
# Output: [0]

In this example, the person would belong to cluster 0.

Conclusion

SVM offers a robust way to handle complex, high-dimensional data. However, it can be computationally expensive, so be mindful when dealing with large datasets. Whether you're classifying flowers or predicting financial trends, SVM's adaptability makes it a tool worth having in your data science toolkit.

Resources:

Guide on Support Vector Machine (SVM) Algorithm (opens in a new tab)

Decision trees K-nearest neighbors