K-Fold Cross-Validation: Improving Train/Test to Avoid Overfitting

K-fold cross-validation is a powerful method to overcome the limitations of a simple train/test split when validating your machine learning model. It helps in creating a more reliable performance metric for your model.

In a basic train/test split, you divide your data into a training set and a testing set. While it provides an initial validation, it's not foolproof; you could still end up overfitting to your specific train/test split. K-fold cross-validation, however, divides your data into 'K' subsets. You use 'K-1' of these for training and the remaining one for testing.

Divide the data into into $\mathrm{K}$ buckets.
Reserve one of those buckets for testing purposes, for evaluating the results of the model.
Train your model on $\mathrm{K}-1$ subsets.
Test the model on the remaining subset.
Average the test metrics from each fold for a final performance error metric.

It is just a more robust way of doing train/test.

Practical Example Using Scikit-learn

Fortunately, scikit-learn makes this really easy to do. Say we wanted to evaluate the degree of polynomial for a polynomial fit. As a reminder, the Iris dataset contains a set of 150 Iris flower measurements, where each flower has a length and width of its petal, and a length and width of its sepal. We also know which one of 3 different species of Iris each flower belongs to. The challenge here is to create a model that can successfully predict the species of an Iris flower, just given the length and width of its petal and sepal.

Single Train/Test Split

Revisiting SVC with the Iris dataset, a single conventional train/test split is made easy with the train_test_split function:

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import datasets
from sklearn import svm
 
iris = datasets.load_iris()
# Split the iris data into train/test data sets with 40% reserved for testing
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)
# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
# Now measure its performance with the test data
clf.score(X_test, y_test)

0.96666666666666667

The model performs well with about 97% accuracy. However, the dataset is small, which raises the risk of overfitting, especially when we consider that we are only using 60% of the flowers for training and only 40% for testing. We could still be overfitting to our specific train/test split that we made.

K-Fold Cross-Validation

To be more confident about the model's performance, we can use K-fold cross-validation. Using $\mathrm{K} = 5$ so that we use 5 different training datasets while reserving 1 for testing:

# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
# Print the accuracy for each fold:
print(scores)
# And the mean accuracy of all 5 folds:
print(scores.mean())

[ 0.96666667  1.          0.96666667  0.96666667  1.        ]
0.98

print(scores) gives us back a list of the actual error metrics from each one of those iterations, that is, each one of those folds. We average those together to get an overall error metric. When we do this over 5 folds, the model performs even better, showing 98% accuracy across all folds.

Comparing Kernels

Is a more complex polynomial kernel better than a simple linear one? Will that be over-fitting or will it better fit the data that we have? This ultimately depends on whether there's actually an underlying linear or polynomial relationship between these petal measurements and the actual species or not:

clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
print(scores.mean())

[ 1.          1.          0.9         0.93333333  1.        ]
0.966666666667

The more complex polynomial kernel produced lower accuracy than a simple linear kernel. The polynomial kernel is over-fitting. But we couldn't have told that with a single train/test split:

# Build an SVC model for predicting iris classifications using training data
clf = svm.SVC(kernel='poly', C=1).fit(X_train, y_train)
# Now measure its performance with the test data
clf.score(X_test, y_test)

0.96666666666666667

That's the same score we got with a single train/test split on the linear kernel.

Conclusion

If we had relied solely on a single train/test split, we could've missed signs of overfitting. K-fold cross-validation is a robust tool for model validation and should be part of your machine learning toolkit.

Bias/variance trade-off Data cleaning & Normalisation