Artificial Intelligence 🤖
Unsupervised Learning & Clustering
Measuring Entropy

Entropy: Measuring Chaos Data

As we saw in Decision Trees, it's crucial to grasp the concept of entropy. It quantifies how alike or different the items in your data are.

What is Entropy?

In a dataset, like a collection of animals sorted by species, low entropy means most animals are the same species. For instance, if you only have iguanas, that's low entropy—pretty orderly. High entropy means you've got a zoo on your hands: iguanas, pigs, sloths, and more. Things are chaotic and different.

Math Behind the Chaos

The entropy of a dataset SS is calculated using the Binary Entropy (opens in a new tab) function:

H(S)=p1lnp1pnlnpnH(S) = -p_1 \ln p_1 - \cdots - p_n \ln p_n

Here pip_i represents the proportion of data that belongs to each class, like species in our animal example. Say you have nn different classes; then p1,p2,,pnp_1, p_2, \ldots, p_n are proportions for each class.

Entropy Graph

The graph above shows what individual terms pilnpip_i \ln p_i look like. When pip_i is either 0 or 1, the contribution to entropy is zero. Why? Because if a class is entirely absent or entirely present, it doesn't add any chaos. The juicy part is in the middle, where a mixture of classes occurs—that’s what spikes up the entropy.

So, in essence, entropy is all about measuring the level of disorder in your dataset. It's the baseline for creating decision trees and various other machine learning algorithms. Understanding entropy is like learning how to read a compass—it points you in the direction your data is taking, be it towards uniformity or chaos.