Artificial Intelligence πŸ€–
LLM Pre-training & Scaling Laws
Pre-training Large Language Models

Pre-Training Large Language Models

Choosing a Model

After figuring out the scope of our application, the next step is to select the model we will work with.

We have two options:

  1. Choose a pre-trained foundation model.
  2. Train our own model to create a custom LLM.

There are specific use cases where the second option might make more sense but in general, we will develop our application using a pre-trained foundation model.

Model Hubs

There are many open-source and paid models available that we can use for our application. Many of the developers of these models have made available "hubs" where we can browse and test the models.

One of the most useful features of these hubs is the inclusion of model cards. Model cards describe important details of a model such as the best use case of the model, how it was trained and its limitation.

Example: Model card for LLaMa: Model Card (opens in a new tab).

Training Large Language Models

Variants of the Transformer model are suited to different tasks. The differences in these variants can be understood by taking a look at how these variants are trained. This, in turn, can help us make an informed decision regarding the model we want to use for our application by helping us better navigate model hubs.

Initial Training Process (Pre-training)

The initial training process of an LLM is called as pre-training. LLMs work by learning a deep statistical representation of language and this deep representation is developed during pre-training.

At a high-level, during pre-training, the model is fed large amounts of unstructured textual data, ranging from gigabytes to petabytes in size. The data is pulled from many sources such as web crawling and corpora of text compiled specifically to train LLMs.

The pre-training process is self-supervised. The model internalizes the patterns and structures present in the language. These patterns then enable the model to complete its training objective, which depends on the architecture of the model. In other words, during pre-training, the model weights get updated to minimize the loss of training objective.

Clearly, this step requires a lot of compute and the use of GPUs.

Additionally, since the data is coming from public sources such as the internet, there is often a data quality filter applied before feeding the data to the LLM so that the training data is of high quality, has low bias and does not have harmful content. Due to this, only about 1-3% of the original tokens are used for pre-training.

Training Objectives for Transformer Variants

The three configurations of a Transformer are trained with different training objectives and thus, learn to perform different tasks.

Encoder-only Models (Autoencoding Models)

The encoder-only variants of Transformers are also called autoencoding models.

They are pre-trained using Masked Language Modeling (MLM). In MLM, tokens in the input sequence are randomly masked and the training objective is to predict the masked tokens in order to reconstruct the original input sequence. This is also called a denoising objective since the masking of the tokens can be thought of as adding noise to the input sequence and then predicting the masked tokens can be thought of as removing that noise from the input sequence.

Autoencoding models build bidirectional context representations of the input sequence, meaning that model has an understanding of the full context of the token rather than just the tokens that come before it.

masked-language-modeling

These models are usually suited to tasks that benefit from this bidirectional context such as sentiment analysis, named entity recognition and word classification, etc.

Examples: BERT, ROBERTA.

Decoder-only Models (Autoregressive Models)

The decoder-only variants of Transformers are also called autoregressive models.

They are pre-trained using Causal Language Modeling (CLM). In CLM, the training objective is to predict the next token based on the previous sequence of tokens. The tokens of the input sequence are masked and the model can only see the input tokens leading up to the token being predicted at the moment. The model has no knowledge of the tokens that come after this token. The model then iterates over the input sequence one-by-one to predict the next token. Thus, in contrast to autoencoding models, the model builds a unidirectional context for each token.

causal-language-modeling

By learning to predict the next token from a vast number of examples, the model builds a statistical representation of the language. Predicting the next token is sometimes called full language modeling by researchers.

These models are most suitable for text generation but large autoregressive models also show strong zero-shot inference ability and can perform a variety of tasks.

Examples: GPT, BLOOM.

Encoder-Decoder Models (Sequence-to-Sequence Models)

The encoder-decoder variants of Transformers are also called sequence-to-sequence models.

The exact details of pre-training objective vary from model to model. For example, FLAN-T5 is trained using span corruption. In span corruption, a part of the input sequence is masked and replaced by a sentinel token. These sentinel tokens are special tokens added to the vocabulary that to do not correspond to any actual word from the dataset. The decoder then has to reconstruct the sentence autoregressively. The output is the sentinel token followed by the predicted tokens.

span-corruption

We can use such models for tasks such as translation, summarization and question answering. They are most useful where the input and output both are bodies of text.

Examples: FLAN-T5, BART.

Computational Challenges in Training LLMs

LLMs have a lot of parameters and thus, require vast compute resources for training. One common issue is memory. Many models are too big to be loaded into a single GPU's memory.

πŸ’‘

If you've ever tried training or even just loading your model on Nvidia GPUs, this error message might look familiar:

OutOfMemoryError: CUDA out of memory.

CUDA, short for Compute Unified Device Architecture, is a collection of libraries and tools developed for Nvidia GPUs.

Consider an LLM with 1 billion parameters. All parameters are stored as floats and using single-precision, each parameter is represented by a 32-bit float which occupies 4 bytes of memory. 1 billion parameters would require 4Γ—1094 \times 10^9 bytes or ~4 GB. Thus, we need a GPU with at least 4 GB of VRAM to even load the model.

Besides this, if you want to train the model, you'll have to plan for additional components that use GPU memory during training. These include two Adam optimizer states, gradients, activations, and temporary variables needed by your functions. This can easily lead to 20 extra bytes of memory per model parameter.

Overall, we will need 6x the memory than what the model weights take up. To train a one billion parameter model at 32-bit full precision, you'll need approximately 24 GB of GPU RAM. This is definitely too large for consumer hardware.

Quantization

We reduce the amount of memory required to store and train the model by reducing the precision from 32-bit floating point (FP32) numbers to 16-bit floating point numbers (FP16) or 8-bit integers (INT8) for the model.

The range of values that can be represented by FP32 is from ~βˆ’3Γ—10βˆ’38-3 \times 10^{-38} to ~3Γ—10383 \times 10^{38}, and FP32 is the default representation for the model.

The corresponding data types used in deep learning frameworks and libraries are:

  • FP32: 32-bit full position
  • FP16 or Bfloat16: 16-bit half precision
  • int8: eight-bit integers.

Quantization statistically projects the 32-bit numbers into a lower precision space, using scaling factors calculated based on the range of the original 32-bit floating point numbers.

FP16 and BF16

For example, consider that we want to store Ο€\pi to 6 decimal places, that is we want to store Ο€=3.141592\pi = 3.141592.

This is stored as:

0β€…β€Š10000000β€…β€Š100100100001111110110000 \; 10000000 \; 10010010000111111011000

In 32-bit representation. The first bit is the sign bit, the next 8 bits represent the exponent and the final 23 bits represent the fraction/mantissa/significand.

The last 23 bits decide the precision of the representation. If we convert this to decimal and compare it to the real value of Ο€\pi, we'll see that the converted number itself has lost precision. But the number is accurate in the 6 decimal places we require.

The same number is projected as:

0β€…β€Š10000β€…β€Š10010010000 \; 10000 \; 1001001000

In 16-bit representation. While there is a sign bit like 32-bit, there are only 5 bits for the exponent and only 10 bits for the fraction. This makes the range of values much smaller (~βˆ’65504-65504 to ~6550465504). When we convert this, we lose even more precision, with the result being 3.1406253.140625.

You'll find that this loss in precision is acceptable in most cases because you're trying to optimize for memory footprint. Storing a value in FP32 requires 4 bytes of memory, whereas storing a value in FP16 requires only 2 bytes of memory, so with quantization you have reduced the memory requirement by half.

Another popular alternative in the AI field is BFLOAT16 or BF16, short for "Brain Floating Point Format", was developed by Google Brain. This has 1 sign bit, 8 exponent bits and 7 fraction bits, like so:

0β€…β€Š10000000β€…β€Š10010010 \; 10000000 \; 1001001

It's a hybrid between FP16 and FP32. It helps with training stability and is supported by NVIDIA GPUs like the A100. BF16 is also called the truncated FP32 since it captures the dynamic range of FP32 but still uses 16 bits. It uses the full 8 bits for the exponent like FP32 but truncates the fraction part to 7 bits. This saves memory and also increases model performance by speeding up calculations. The disadvantage is that BF16 is not well suited for integer calculations, which are anyways rare in deep learning.

INT8

If we project the 32-bit Ο€\pi representation to INT8, it will be stored as:

0β€…β€Š00000110 \; 0000011

If we use one bit for the sign, there are 7 bits for the fraction. The range of values is βˆ’128-128 to 127127. Ο€\pi is simply projected to 33. The memory requirement is halved even further, for a total reduction by a factor of 14\frac{1}{4}.

But nowadays, it's common for models to have 50 billion or over 100 billion parameters, requiring thousands of gigabytes of GPU memory. In such cases, we need to split the model across multiple GPUs for training using efficient multi-GPU compute strategies.

Comparison

In comparison, the memory requirements for storing parameters in FP32, FP16 and INT8 are:

Data TypeBitsExponentFractionMemory needed to store one value
FP32328234 bytes
FP16165102 bytes
BFLOAT1616872 bytes
INT88--71 byte

Remember that the goal of quantization is to reduce the memory required to store and train models by reducing the precision off the model weights. Quantization statistically projects the original 32-bit floating point numbers into lower precision spaces using scaling factors calculated based on the range of the original 32-bit floats. Modern deep learning frameworks and libraries also support quantization-aware training (QAT), which learns the quantization scaling factors during the training process.

Scaling Choices

The goal during pre-training an LLM is to maximize the model's performance on its learning objective. This is equivalent to minimizing the loss function. There are two choices to achieve better performance:

  • Increasing the dataset size in terms of number of tokens.
  • Increasing the model size in terms of number of parameters.

These are the scaling choices available to us. In theory, we can scale either or both the dataset size and the model size, but we are constrained by compute budget in terms of GPUs, training time, cost, etc.

Scaling laws

Compute Metrics (Petaflops/s-day)

There are some popular units used to measure compute budget.

One of them is the petaflops/s-day (PF-days). It is the number of floating point operations performed at a rate of 1 petaflop (1 quadrillion floating point operations) per second for one day.

With respect to training transformers, 1 petaflop/s-day is equivalent to about 8 NVIDIA V100 GPUs or 2 NVIDIA A100 GPUs operating at full efficiency for 24 hours.

The graph below shows the petaflops/s-day measurements for some popular models. BERT/ROBERTA are encoder-only models, T5 is an encoder-decoder model and GPT3 is a decoder-only model. The yy-axis is logarithmic. The xx-axis varies in terms of the number of parameters trained.

petaflops-per-second-day-comparison

Compute Budget, Dataset Size, Model Size vs Model Performance

Researchers have explored the relationships between dataset size, model size and compute budget. In the paper Scaling Laws for Neural Language Models (opens in a new tab) (OpenAI, 2020), we find the following figure:

scaling-laws

The graph shows a clear relationship between model performance and each of the three factors, which can be approximated by a power-law relationship. That is, one is proportional to the other raised to some power. When plotted on a graph where both axes are logarithmic, such relationships appear as a straight line. The relationship only holds when the training is not bottlenecked by the other two factors.

More often than not, compute budget is a hard constraint, determined by:

  • Hardware Availability
  • Project Timeline
  • Financial Budget

Thus, most of the time, we end up increasing the model size or the dataset size to increase performance.

Compute-Optimal Models

In the paper Training Compute-Optimal Large Language Models (opens in a new tab) (DeepMind, 2022, popularly referred to as the Chinchilla paper), researchers tried to find the optimal number of parameters and volume of training data for a given compute budget. Such models are called compute-optimal models. The paper found the following:

  • Very large models maybe be over-parameterized and under-trained. They have more parameters than they need to achieve a good understanding of language and they would benefit from seeing more training data.
  • Smaller models trained on more data could perform as well as large models.
  • Compute-optimal training dataset size is ~20 times the number of parameters.
  • There is a relationship between model size (in number of parameters) and the optimal number of tokens to train the model with. For instance, while increasing dataset size is helpful, if we do not jointly improve the model size, it might not be able to capture value from the larger dataset.
  • There is a relationship between model size (in number of parameters) and the optimal number of tokens to train the model with. This relationship is describe in the Chinchilla paper, that shows that many models might even be overparametrized according to the relationship they found.

The paper presented the Chinchilla model as a proof of their findings. The Chinchilla model uses the same compute budget as another model called Gopher (opens in a new tab), but has 4 times fewer parameters (70 billion vs 280 billion) and uses 4 times more training data. It consistently outperforms Gopher while being significantly smaller.

Due to the findings of the paper, research teams have started developing smaller models that achieve similar if not better performance than larger models trained in a non-optimal way. See the BloombergGPT (opens in a new tab) case study for an example. It was trained in a compute optimal way following the Chinchilla loss and so achieves good performance with the size of 50 billion parameters (for reference PaLM had 540B).

Domain Adaptation

Using a pre-trained LLM can help us save time and get to a working solution much faster. However, there is a situation where it may be necessary to pre-train our own model. If the domain of the problem we are trying to solve uses vocabulary and language structures that are not commonly used in day-to-day language, we might need to train our own model. This is called domain adaptation.

Because models learn their vocabulary and understanding of language through the original pre-training task, pre-training your model from scratch will result in better models for highly specialized domains like law, medicine, finance or science.

Legal Domain

Legal language is often very different from day-to-day language and usually requires domain adaptation. For example, the usage of latin terms. Moreover, it also uses everyday words in different contexts.

legal-language-example

These words are rarely used outside of the legal world, which means that they are unlikely to have appeared widely in the training text of existing LLMs. As a result, the models may have difficulty understanding these terms or using them correctly.

Medical Domain

The medical domain also uses uncommon words to describe diseases. It also involves the usage of language in an idiosyncratic way, as shown in the image below. The line Sig: 1 tab po qid pc & hs does not make much sense to us but is actually a shorthand used by doctors to write prescriptions. It makes a lot of sense to a pharmacist (take one tablet by mouth four times a day, after meals and at bedtime).

medical-language-example

Finance Domain: BloombergGPT

BloombergGPT (opens in a new tab) is is a large Decoder-only LLM developed by Bloomberg for the finance domain. It was trained on data consisting of both finance data (~51%) and general-purpose text data (~49%). The model achieves best-in-class performance in finance-related tasks while also maintaining competitive performance in general language tasks.

During the training of BloombergGPT, the authors used the Chinchilla Scaling Laws to guide the number of parameters in the model and the volume of training data, measured in tokens. The recommendations of Chinchilla are represented by the lines Chinchilla-1, Chinchilla-2 and Chinchilla-3 in the image, and we can see that BloombergGPT is close to it.

Bloomberg Chinchilla

While the recommended configuration for the team’s available training compute budget was 50 billion parameters and 1.4 trillion tokens, acquiring 1.4 trillion tokens of training data in the finance domain proved challenging. Consequently, they constructed a dataset containing just 700 billion tokens, less than the compute-optimal value. Furthermore, due to early stopping, the training process terminated after processing 569 billion tokens.

The BloombergGPT project is a good illustration of pre-training a model for increased domain-specificity, and the challenges that may force trade-offs against compute-optimal model and training configurations. You can read the BloombergGPT article here (opens in a new tab).

BLOOM: BigScience 176B Model

BLOOM (BigScience Language Open-science Open-access Multilingual) is a open-source LLM with 176B parameters trained in an open and transparent way. In this paper, the authors present a detailed discussion of the dataset and process used to train the model. You can also see a high-level overview of the model (opens in a new tab)

  1. The model:
    • 176B parameters decoder-only architecture (GPT-like)
    • 70 layers - 112 attention heads per layers - hidden dimensionality of 14336 - 2048 tokens sequence length
    • ALiBi positional embeddings - GeLU activation function
    • More information:
  2. The dataset:
  3. The engineering side:
  4. Environmental considerations
    • Jean Zay (opens in a new tab), the supercomputer used for model training, is mostly powered by nuclear energy, which is a low carbon energy source.
    • Significant efforts were made to make sure that the computing infrastructure is as efficient as possible β€” the heat generated by the hardware even gets used for heating buildings on campus!
    • More information:
      • Currently working on making a precise estimate of the carbon emitted during all of the steps of model training, including intermediate experiments as well as inference.

Resources: