Artificial Intelligence 🤖
Unsupervised Learning & Clustering
K-means clustering

K-Means Clustering

K-Means clustering is a tool in machine learning for grouping a dataset into a number of clusters. Imagine you own an online shop and you want to understand your customer base. K-Means helps you to separate your customers into groups based on their shopping habits, age, or location. This algorithm works well for datasets where you can't easily see the patterns, especially when you're dealing with higher dimensional data or if you cannot directly identify structure by plotting the data.

Furthermore, even when we can identify structure in data by plotting it, we often want to describe the structure in a way that we can make further use of it.

Applications

This is an unsupervised learning technique. In simpler terms, the algorithm looks at the inherent structure of your data without you telling it how to categorize each point. Clusters are based just based on the attributes of the data alone. The goal is to uncover unknown latent categories within your data, making it especially useful for market segmentation, social network analysis, and much more.

Say you run an online shopping website. You have loads of customer data-browsing history, purchase records, age, location-you name it. Viewing all these as one big mass won't tell you much. But what if you could split them into meaningful groups? With clustering, you can. Maybe you discover one group is mostly college students in London who love outdoor gear, and another group is retirees in Manchester looking for home decor. Knowing this can make your marketing more targeted and way more effective.

The site is also unlikely to have a huge amount of purchasing data for every single one of its customers. By grouping together similar customers, the site can leverage a lot more data to determine which products a given customer is likely to buy.

How Does It Work?

The K-Means algorithm follows a series of steps:

  1. Pick initial centroids: Randomly pick KK points in your data. These serve as the initial centers of your clusters.
  2. Assign data points: Each point is then assigned to the nearest centroid, and it becomes a member of that cluster.
  3. Recalculate centroids: The centroid of each cluster is then updated based on the average position of each centroid's points. That is, for a given cluster that we end up with, we will move that centroid to be the actual center of all those points.
  4. Repeat: Steps 2 and 3 are repeated until the centroids no longer change.

Each iteration of these steps reduces the within-cluster variation (i.e. readjusts the clusters so that they're more compact) and, eventually, the clusters will settle down to a stable configuration: neither the labels nor the centroids will change. A few steps of this process are illustrated in the figure below (using K=4K = 4).

What group a given data point belongs to is defined by which of these centroid points it's closest to in your scatter plot.

⚠️

The number of clusters is an optional parameter in scikit-learns sklearn.cluster.KMeans (opens in a new tab) function. However, this implementation is not cleverly selecting the number of clusters based on properties of the data. Instead it's just using the fixed default value of 8 clusters.

Restrictions & Limitations

K-means clustering cannot be applied to all data sets, and even when it can be applied, it may not give great results:

  1. Data type: K-Means requires numerical data to calculate distances. There are alternative clustering methods that can deal with categorical or mixed data types, such as K-modes or K-medoids with the Gower metric.
  2. Number of clusters (KK): You need to specify the number of clusters in advance, which might not always be evident. Often, K is chosen using domain expertise. You can also often get an idea for the number of clusters through plotting and data exploration.
  3. Limited Scope: K-Means is good for spherical and equally sized clusters, but it won't capture more complex patterns i.e. there are numerical data sets that are not well-suited to K-means

There are also some limitations to k-means clustering.

  1. Choosing K: Picking the right number of clusters is crucial, and there's no definitive guide for it.
  2. Local Minima: Initial random choice of centroids can affect the outcome, so it's good to run the algorithm multiple times and average (i.e. Ensemble Learning).
  3. No Labels: K-Means can tell you which items are similar but won't tell you how they are similar.

Python Example

Let's see just how easy it is to do k-means clustering using scikit-learn and Python. The first thing is create some random data and build clusters into our fake test data. To actually do k-means, It is incredibly easy.

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from numpy import random, float
data = createClusteredData(100, 5)
model = KMeans(n_clusters=5)
# Note I'm scaling the data to normalize it! Important for good results.
model = model.fit(scale(data))
# We can look at the clusters each data point was assigned to
print(model.labels_)
# And we'll visualize it:
plt.figure(figsize=(8, 6))
plt.scatter(data[:,0], data[:,1], c=model.labels_.astype(float))
plt.show()

All you need to do is import KMeans from scikit-learn's cluster package. Once we've called fit on our model, we can look at the resulting labels.

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

K-means clustering

It didn't take that long. It did come up with some reasonable guesses at the clusters. This is probably an example of where four clusters would more naturally fit the data.

Takeaway

K-Means is a powerful but easy-to-implement tool for quick and insightful analysis for clustering problems. Its versatility makes it a go-to algorithm for various applications, but remember, the quality of results heavily depends on the nature of the data and the choice of KK.