Artificial Intelligence 🤖
Practical Aspects of Deep Learning
Dataset Splits

Train, Dev, and Test Sets in Neural Network Development

In deep learning, we use training, development (dev), and test sets to build and evaluate our models. In general, the context is that we:

  1. Define the model structure (such as number of input features and outputs)
  2. Initialize the model's parameters.
  3. Loop
    • Calculate current loss (forward propagation)
    • Calculate current gradient (backward propagation)
    • Update parameters (gradient descent)

It's impossible to get all your hyperparameters right on a new application on the first attempt, so the idea is you go through iterations of Ideas, Code, and Experimentation. Your data will be split into three parts:

  1. Training Set: This is where the model learns. It's the largest data set and is used to fit the model's parameters.
  2. Dev Set: Also known as the Hold-out cross validation set, it's used to fine-tune hyperparameters and compare different models to select the best performer.
    • Choose dev set and test set to reflect data you expect to get in the future in the real-world, and consider what is important to do well on.
  3. Test Set: After choosing the best model using the dev set, the test set provides an unbiased evaluation of the model's performance.
    • This set must come from the same distribution as the dev set.

You will try to build a model upon the training set. Use the Dev (Hold-out cross validation set) to see which of many different models performs the best. When you have a final model that you want to evaluate, you can take the best model you have found and evaluate it on your test set. You do this in order to get an unbiased estimate of how well your algorithm is doing.

Consistency is key: the dev and test sets should come from the same data distribution to ensure the model's performance is assessed accurately. For example, if the training set consists of cat pictures from the web and the dev/test pictures are from users phones, they will mismatch. Sometimes, if only a dev set is used, it's still important to call it a dev set since it's part of the model development process.

Setting up the dev set, as well as the evaluation metric, is really defining what target you want to aim at. There are scenarios where you might need to revise your dev or test sets, as well as the evaluation metric. In general, if doing well on your metrics with the current dev/test sets does not correspond to doing well on your application, change your metric and/or dev/test sets.

You may also see a train-dev set, which is a random subset of the training set, so it has the same distribution. This is used to investigate data mismatch problems.

Sizing Dev & Test Sets

The goal of the dev set is to test different algorithms on it and see which algorithm works better - it just needs to be big enough for you to evaluate two different algorithm choices or ten different algorithm choices and quickly decide which one is doing better. And you might not even need a whole 20% of your data for that.

The size of each data set can vary:

  • For smaller datasets ( 100 to 1,000,000 ) a 60/20/20 split between train/dev/test is common.
  • For larger datasets, ( 1,000,000 to ∞\infty ) more data is usually given to the training set, sometimes as much as 98%, leaving 1% for both dev and test.