K-Nearest Neighbors (KNN) Algorithm

The K-Nearest Neighbors (KNN) algorithm is a straightforward technique primarily used for classification and regression. It operates on the principle that similar data points are likely to have similar labels. This algorithm comes in handy when you have existing data that you've already classified and want to categorize new, unseen data points.

KNN Scatter Plot

In a scatter plot, for example, KNN would identify the closest training samples to a new data point 🟢 and let those samples "vote" on its classification. If you set $K = 3$ , the algorithm will consider the three nearest neighbors of a new point to determine its label, in this case 🔺. However, the choice of $K$ can significantly affect the outcome. Say we had chosen $K = 5$ , here we would actually then classify the new point as 🟦. Overall, determining the optimal value for $K$ often involves techniques like train/test splits.

KNN doesn't really "learn" from the training data; rather, it memorizes it. Therefore, it is called a non-parametric and "lazy" algorithm.

Distance Measures

How the algorithm calculates "nearness" depends on the distance metric used. Some common distance metrics include:

Euclidean Distance: The straight-line distance between two points. Conceptually, it should be used whenever we are comparing observations with continuous features, like height, weight, or salaries.
Manhattan Distance: The sum of the absolute differences between coordinates of pairs of points. This way, instead of estimating a straight line between two points, we 'walk' through available paths. The Manhattan distance is useful when our observations are distributed along a grid, like in chess or city blocks (when the features of our observations are entire integers with no decimal parts).
Cosine Similarity: Measures the cosine of the angle between two vectors.
Hamming Distance represents the number of points at which two corresponding pieces of data can be different. While comparing two vectors of equal length, Hamming distance is the number of bit positions in which they differ. This metric is generally used when comparing text or binary vectors.

Hamming

How does KNN work?

KNN performs classification or regression tasks for new data by calculating the distance between the new example and all the existing examples in the dataset. The algorithm stores the entire dataset and classifies each new data point based on the existing data points that are similar to it. KNN makes predictions based on the training or “known” data only. After the user defines a distance function, like the ones we mentioned earlier, KNN calculates the distance between data points in order to find the closest data points from our training data for any new data point. The existing data points closest to the new data point using the defined distance will become the “k-neighbors”. For a classification task, KNN will use the most frequent of all values from the k-neighbors to predict the new data label. For a regression task, the algorithm will use the average of all values from the k-neighbors to predict the new data value.

All KNN does is store the complete dataset, and without doing any calculation or modeling on top of it, measure the distance between the new data point and its closest data points. For this reason, and since there’s not really a learning process happening, KNN is called a “lazy” algorithm (as opposed to “eager” algorithms like Decision Trees that build generalized models before performing predictions on new data points).

Optimal Value of $K$

Choosing the right $K$ is crucial. The Elbow method and Grid search are popular techniques for finding the optimal $K$ .

Elbow method

The Elbow method is a technique used to find the optimal number of neighbors ( $K$ ) for tasks like classification and regression. By testing different $K$ values and measuring the resulting error on training and test sets, one can identify the "elbow point" where the error rate stabilizes. This point represents the optimal $K$ value.

Elbow

In the given example, the error rate becomes almost constant after $K > 23$ , suggesting that $K = 23$ is the optimal value for this particular dataset.

Grid search

Another alternative for deciding the value of $K$ is using grid search to find the best value. Grid search is a process that searches exhaustively through a specified numerical space for the algorithm to optimize a given parameter (which in this case is the error rate). Tools like Python’s GridSearchCV (opens in a new tab) can automate the fit of KNN on the training set while validating the performance on the testing set in order to identify the optimal value of $k$ .

Implementing KNN in Python

K-Nearest Neighbors (KNN) can be quickly implemented in Python using Scikit-Learn. Here's a simple code snippet:

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(data.iloc[:,0:4], data['Name'])
print(neigh.predict(test))  # Output: ['Iris-virginica']

This code trains a KNN model to classify flower species based on their features. The n_neighbors=3 means it looks at the three nearest data points to make a prediction.

Anomaly Detection using KNN

While KNN is mostly known as a supervised learning algorithm, it can also work in an unsupervised manner for tasks like anomaly detection. In this case, we use Scikit-Learn's NearestNeighbors to measure distances between points and identify outliers. Say we created a model, and plotted the mean of the distances to the K nearest neighbours:

# create model
nbrs = NearestNeighbors(n_neighbors = 10)
# fit model
nbrs.fit(df)
# distances and indexes of k-neaighbors from model outputs
distances, indexes = nbrs.kneighbors(df)
# plot
plt.figure(figsize=(15, 7))
plt.plot(distances.mean(axis =1))

Anomaly

The graph above suggests you need a "threshold" to flag a data point as abnormal. You can set this threshold using basic statistics.:

distances_mean.describe()

Will return:

count    856.000000
mean       7.351681
std        3.281417
min        3.141603
25%        5.342177
50%        6.584519
75%        8.359990
max       42.124777
dtype: float64

In the example, the 75th percentile of distances was 8.35, so a threshold of 9.0 was chosen.

th = 9.0
outlier_index = np.where(distances_mean > th)
outlier_index

And we can visualise the Anomoly points:

outlier_values = df.iloc[outlier_index]
outlier_values
# plot data
plt.figure(figsize=(20, 7))
plt.plot(df["Date"], df["Open"], color = "b")
# plot outlier values
plt.scatter(outlier_values["Date"], outlier_values["Open"], color = "r")

Anomaly

Setting this threshold helped identify outliers in a stock graph. The red points in the graph signal unusual stock activity, either plunging or soaring. That's KNN for you - easy to implement and versatile enough for both classification and spotting outliers!

Conclusion

The KNN algorithm is one of the simplest classification algorithms. Even with such simplicity, it can give highly competitive results for both classification and outlier detection. KNN algorithm can also be used for regression problems. The only difference from the discussed methodology will be using averages of nearest neighbors rather than voting from nearest neighbors.

Resources:

Support Vector Machine (SVM)K-means clustering