Decision Trees and Random Forests

What is a Decision Tree?

A decision tree is a powerful machine learning tool that resembles a flowchart. It helps you make decisions by evaluating multiple attributes or variables. For example, lets say the coffee machine is not working, how would we fix it? A decision tree would look at conditions, such as power, beans, warnings and guide you through a series of questions to make that choice.

The way it works is pretty simple. Given a dataset - let's say, historical coffee machine data and what the issue was - decision trees generate a flowchart. The flowchart consists of nodes representing decisions, and by answering those, you can arrive at a conclusion. Imagine the flowchart as a series of if-then-else questions.

Decision Trees for Resume Filtering

Imagine you're swamped with resumes and want an efficient way to sort out the best candidates. You can train a decision tree based on historical data of previously hired candidates. The tree would consider variables such as years of experience, level of education, and whether the candidate did an internship, among others.

Using a hypothetical dataset, a trained decision tree might look like this:

Walking through this tree can efficiently narrow down suitable candidates for an interview, making the hiring process quicker and more reliable.

How Do Decision Trees Work?

The underlying algorithm of a decision tree is surprisingly straightforward. At each node, the algorithm picks the attribute that minimizes entropy, which is essentially a measure of disorder or randomness. The goal is to arrive at leaf nodes where the outcome is as homogeneous as possible (either all 'yes' or all 'no'). The commonly used algorithm for this is called ID3 (Iterative Dichotomiser 3).

It is what's known as a greedy algorithm. So as it goes down the tree, it picks the attribute that will minimize entropy at that point. That may not actually result in an optimal tree that minimizes the number of choices that you have to make, but it will result in a tree that works, given the input data.

What are Random Forests?

While decision trees are handy, they are prone to overfitting; they can perform very well on the data they were trained on but fail on new, unseen data. That's where Random Forests come in. This technique constructs multiple decision trees, each trained on a random subset of the data. Each tree gets a vote on the final decision, creating a more robust model.

The technique of randomly resampling our data with the same model is bootstrap aggregating, or bagging. This is a form of ensemble learning,

Random Forests also introduce another layer of randomness. What we can do is randomly restrict the number of attributes that each tree can consider at each decision point, while it is trying to minimize the entropy as it goes. This variety from tree to tree ensures that the model is well-rounded and less prone to over-fitting.

Decision trees - Predicting hiring decisions using Python

Creating a decision tree to predict hiring decisions is surprisingly simple in Python, thanks pandas and scikit-learn. The example uses a CSV file, where the final column indicates whether the candidate received a job offer.

Years Experience	Employed?	Previous employers	Level of Education	Top-tier school	Interned	Hired
10	Y	4	BS	N	N	Y
0	N	0	BS	Y	Y	Y
7	N	6	BS	N	N	N
2	Y	1	MS	Y	N	Y
20	N	2	PhD	Y	N	N

Before running machine learning algorithms, the data needs some pre-processing. For instance, scikit-learn requires numerical values. To do this, we map textual labels like 'Y' and 'N' or educational levels like 'BS', 'MS', and 'PhD' to numerical values.

df['Hired'] = df['Hired'].map({'Y': 1, 'N': 0})
df['Level of Education'] = df['Level of Education'].map({'BS': 0, 'MS': 1, 'PhD': 2})
# ... and so on for other columns

We isolate the columns that contain features (attributes) and the column that contains the target (hiring decision).

features = list(df.columns[:6])
y = df["Hired"]
X = df[features]

Finally we train the decision tree model using the following code:

from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

The trained model can be visualized:

from IPython.display import Image
from sklearn.externals.six import StringIO
import pydotplus
 
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
                         feature_names=features)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())

How to Read the Tree

The decision tree presents a flowchart where each node represents a decision based on one of the features. It asks questions like, "Is the candidate currently employed?" Depending on the answer, you follow the tree to the next node, eventually reaching an end node (leaf) that tells you the likely hiring decision.

The gini score at each node indicates how mixed the samples at that point are - lower is better, aiming for zero, which would mean all samples are of one kind.

For instance, in this example, if a person is currently employed (Employed? = 1), they would likely receive a job offer. On the other hand, if they are not currently employed but have had an internship, they also are likely to receive a job offer.

This provides a practical, easy-to-understand way to predict hiring decisions based on various factors and in only a few lines of Python code to implement.

Ensemble Learning: Random Forest

Now, let's say we want to use a random forest. To do this, you can create a Random Forest model with 10 trees using the same feature (X) and target (y) data you have.

from sklearn.ensemble import RandomForestClassifier
 
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X, y)
 
# Predict employment for different candidates
print(clf.predict([[10, 1, 4, 0, 0, 0]]))  # Employed 10-year veteran
print(clf.predict([[10, 0, 4, 0, 0, 0]]))  # Unemployed 10-year veteran

The output would look something like this:

[1]
[0]

What we did here is:

RandomForestClassifier(n_estimators=10) sets up a forest with 10 decision trees.
clf.fit(X, y) trains the model with your data.
clf.predict(...) lets you make predictions. You don't need to manually go through each tree; the model handles this complexity for you.

Here's the twist: Random Forests introduce an element of randomness. That means you may not get the same result every time you run the prediction. This randomness comes from "bagging" (bootstrap aggregating) the data for each tree. While this generally improves your model, too few trees can make your predictions inconsistent.

In essence, Random Forests offer a powerful and simple way to improve your model's performance, but make sure to choose an appropriate number of trees to maintain consistent predictions.

Naïve Bayes Support Vector Machine (SVM)