Naïve Bayes

The Naïve Bayes classifier stands out due to its basis in probability theory and its efficiency in categorical prediction tasks. One of the most prevalent applications are the classification of emails as either spam or non-spam.

The Core: Bayes' Theorem

In machine learning, and particularly in spam classification, Bayes' Theorem plays a pivotal role. Presented mathematically as:

P(A \mid B) = \frac{P(A) \cdot P(B \mid A)}{P(B)}

The theorem assesses the probability of event $A$ occurring given that $B$ is true, calculated by taking the probability of $A$ and $B$ occurring together and dividing it by the likelihood of $B$ occurring. In the context of spam classification:

P(\text{Spam} \mid \text{Free}) = \frac{P(\text{Spam}) \cdot P(\text{Free} \mid \text{Spam})}{P(\text{Free})}

This helps calculate how likely an email is spam, given it contains certain words, like "free". The numerator can just be thought of as the probability of a message being Spam and containing the word Free. The denominator is just the overall probability of containing the word Free. Moreover, to uncover the elusive $P(\text { Free })$ , if it isn’t readily available, you can derive it:

P(\text { Free }) = P(\text { Free } \mid \text { Spam }) P(\text { Spam }) + P(\text { Free | Not Spam }) P(\text { Not Spam })

An email is a combination of various words, not merely "free". So, while determining whether it's spam, our model, trained on numerous words, multiplies the probabilities of each word being spam to get the overall probability of that e-mail being spam. This is where the "naive" aspect enters the scene – the model presumes each word impacts the spamminess independently, without considering word pair implications or order.

Implementing a spam classifier with Naïve Bayes

Creating a spam classifier is surprisingly straightforward in python. Although data preparation often claims the majority of the effort, the actual machine learning can be achieved with a couple lines of code.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

Using CountVectorizer, we dismantle emails into words, counting their occurrences, while MultinomialNB empowers us with the Naive Bayes functionalities:

# Importing and reading files, manipulating data...
# ... resulting in a DataFrame 'data' with email messages and their classification (spam/ham).
 
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)
 
classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

In essence, the code above converts emails into a list of word counts (counts) and employs the Naive Bayes classifier to learn from this data, associating word patterns with spam or non-spam (ham) categories.

Once trained, the classifier is ready to predict new emails’ categories:

examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

We first transform the messages into the same format that the model was trained on by representing those words as different values in a sparse matrix. Then, we use the predict() function on the classifier to see what we come up with:

array(['spam', 'ham'],
      dtype='<U4')

And sure enough, it works! So, given this array of two input messages, "Free Money now!!!" and "Hi Bob", it's telling me that the first result came back as spam and the second result came back as ham, which is what I would expect.

It's crucial to acknowledge the trade-offs, as this naive model neglects word relationship nuances, which more advanced models might consider. But for a relatively simple and computationally inexpensive approach, it offers quite the bang for the buck in many applications!

Multi-level models Decision trees