Sequence to Sequence Models
Sequence to sequence models, often referred to as Many to Many models, are crucial in driving progress in areas like machine translation, speech recognition, and image captioning.
Basic Model Architecture
Consider a machine translation problem where represents a sequence in French and denotes its English translation.
The foundational model architecture consists of two key components:
-
Encoder: This could be an RNN, LSTM, or GRU, designed to process the input sequence (e.g., a French sentence) and compress its information into a context vector.
-
Decoder: Also an RNN, LSTM, or GRU, which takes the context vector and generates the corresponding output sequence (e.g., the translated English sentence).
These ideas are from the following papers:
- Sutskever et al., 2014. Sequence to sequence learning with neural networks
- Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation
For image captioning, where is an image and is a caption, the model employs a pretrained CNN (like AlexNet) as the encoder and an RNN as the decoder.
They do not generate the sentence word by word, but instead need an entire sentence in one go. Vocab would be too large to enumerate all the choices in one, so we use a search algorithm for this. These ideas are from the following papers (they share similar ideas):
- Maoet et. al., 2014. Deep captioning with multimodal recurrent neural networks
- Vinyals et. al., 2014. Show and tell: Neural image caption generator
- Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions
Selecting the Most Likely Sentence
The language model we have been looking at is very similar to the decoder part of the machine translation model, except for . In the language model, we always start with . In the machine translation model, we will start with being a representation of the input sentence.
The problems formulations are also are different:
- In language model:
- In machine translation :
In machine translation, the goal isn't to randomly sample output sentences. This may provide some choices as an output. For example, given the French sentence "Jane visite l'Afrique en septembre," the translation could vary:
- "Jane is visiting Africa in September."
- "Jane is going to be visiting Africa in September."
- "In September, Jane will visit Africa."
The objective is to select the best possible translation. A popular algorithm for this is beam search, which outperforms greedy approaches that might otherwise yield suboptimal solutions like "Jane is going to be visiting Africa in September". Essentially, we want to maximize:
Where is the sentence in French. The most common algorithm is the beam search. Why not use greedy search? Why not get the best choices each time? It turns out that this approach doesn't really work! Let's explain it with an example.
Say the best output for the example we talked about is "Jane is visiting Africa in September."
Suppose that when you are choosing with greedy approach, the first two words were "Jane is", the word that may come after that will be "going" as "going" is the most common word that comes after "is" so the result may look like this: "Jane is going to be visiting Africa in September". And that isn't the best/optimal solution.
So, a better approach than the greedy approach, is to get an approximate solution, that will try to maximize the output (the last equation above). The most common thing to do is use an approximate search out of them. And, what an approximate search algorithm does is it will try, it won't always succeed, but it will to pick the sentence, , that maximizes that conditional probability.
Beam Search
Beam search, a heuristic search algorithm, is widely used for deriving the best output sequences. It involves selecting a set number of most likely word sequences at each step, guided by a parameter known as the beam width (). For instance, with , the algorithm considers the top three candidate words or sequences at each step, ensuring that only the most likely paths are pursued.
For example, the first step you may get get ["in", "jane", "september"]
- these are words that are the best candidates for the first word.
Then for each word in the first output, we get next second words and select the top best combinations, whereby the best are those that give the highest value when the probabilities are multiplied:
As we care about the pair that is the most likely, we multiply the probabilities. We may then have ["in september", "jane is", "jane visit"]
. Notice that we automatically discard September as a first word, as we keep candidates.
We repeat the same process and get the best words for ["september", "is", "visit"]
and so on. In this algorithm, keep only instances of your network. Note also that if this will become the same as the greedy search.
Refinements to Beam Search
The first refinement is Length Optimization. To counter the tendency of producing overly short translations, length normalization is applied. This involves summing the logarithms of probabilities for numerical stability and adjusting for sequence length, controlled by a hyperparameter , typically set around .
To illustrate, in beam search, we are trying to maximize:
To do this, we multiply:
In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations. Each probability is a fraction, and most of the time a small fraction. Multiplying small fractions will cause a numerical underflow. Meaning that it's too small for the floating part representation in your computer to store accurately. So, in practice we tend to sum the logs of probabilities instead of multiplying directly:
By taking 's , you end up with a more numerically stable algorithm that is less prone to rounding errors (numerical underflow). The function is a strictly monotonically increasing function, so maximising that should also minimise the log.
But there's another problem. The two optimization functions we have mentioned are preferring small sequences rather than long ones. Because multiplying more fractions gives a smaller value, so fewer fractions will give a bigger result. So, there's another step - dividing by the number of elements in the sequence.
We can normalize this by the number of words in the translation. This significantly reduces the penalty for outputting longer translations. Here:
- is the length of the translation
- is a hyperparameter to tune, typically set to
- If no sequence length normalization
- If full sequence length normalization.
The second refinement is choosing Beam Width (). A larger increases the number of possibilities explored, potentially improving results at the cost of higher computational expense. Commonly, is set to 10 in production settings, though larger values may be used in research.
Unlike exact search algorithms like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search runs faster but is not guaranteed to find the exact solution.
As you run beam search, you will see B sentences at different lengths, up until 30 let's say e.g, 1 ,2, 3, 4 ... 30 then score based on the objective function and pick the one that achieves the highest value on this normalized log probability objective i.e., normalized log likelihood objective.
Error Analysis in Beam Search
The concepts of error analysis can also help us distinguish whether inaccuracies stem from the beam search or the RNN model. By comparing probabilities of the ideal translation () and the model-generated translation (), one can identify which component is underperforming.
Suppose that we have the following:
- Initial info:
- = "Jane visite l'Afrique en septembre."
- = "Jane visits Africa in September."
- The ideal answer generated by a human
- = "Jane visited Africa last September."
- Answer produced by model.
- Our model that has produced a bad result.
We now want to know who to blame - the RNN or the beam search. To do that, we calculate and . There are two cases:
- Case 1:
- Conclusion: Beam search is at fault.
- Beam search chose even though has a higher probability
- Case 2:
- Conclusion: RNN model is at fault.
- is a better translation, but the RNN predicts that
Do this for examples and count how many times the beam search is at fault and how many times the RNN model is at fault. This will help you decide what to work on next since you can figure out which faction of errors are ascribed to the search algorithm instead of the RNN model.
BLEU Score (Bilingual Evaluation Understudy)
The BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate machine translations. Given a sentence in a language there are one or more possible good translation in another language. It compares the machine-generated translation to one or more human reference translations, using a modified precision measure that accounts for the frequency of words and -grams in the reference texts.
The BLEU score incorporates a brevity penalty () to address the tendency of favoring shorter outputs. For a comprehensive evaluation, BLEU scores across different -grams (unigrams, bigrams, etc.) and averages.
The intuition is: as long as the machine-generated translation is pretty close to any of the references provided by humans, then it will get a high BLEU score.
Let's take an example:
- = "Le chat est sur le tapis."
- = "The cat is on the mat." (Human reference 1)
- = "There is a cat on the mat." (Human reference 2)
Suppose that the machine outputs: "the the the the the the the.". One way to evaluate the machine output is to look at each word in the output and check if it is in the references. The precision because "the" appeared in or . This is not a useful measure! Basically, "What fraction of words in the output also appear in ref" is not a good scoring measure.
Instead, what we're going to use is a modified precision measure in which we will give each word credit only up to the maximum number of times it appears in the reference sentences. We are looking for the reference with the maximum number of a particular word and set the maximum appearing of this word to this number. So, modified precision because the max is in . We clipped the times by the max which is . Here we are looking at one word at a time - unigrams, we may look at -grams too.
BLEU Score on Bigrams
The -grams typically are collected from a text or speech corpus. While the items are words, -grams may also be called shingles. An -gram of size 1 is referred to as a unigram; size 2 is a bigram (or, less commonly, a digram); size 3 is a trigram etc.
- = "Le chat est sur le tapis."
- = "The cat is on the mat." (Human reference 1)
- = "There is a cat on the mat." (Human reference 2)
Suppose that the machine outputs: "the cat the cat on the mat." The bigrams in the machine output are:
Pairs | Count | Count clip |
---|---|---|
the cat | 2 | 1 () |
cat the | 1 | 0 |
cat on | 1 | 1 () |
on the | 1 | 1 () |
the mat | 1 | 1 () |
Totals | 6 | 4 |
The modified precision for unigrams:
And for -grams:
Let's put this together to formalize the BLEU score:
For example, if we want BLEU for 4, we compute , , , and then average them and take the .
Here is the brevity penalty and is the number of -grams. The brevity penalty is a penalty term that penalizes the machine translation model for outputting a translation that's too short. It's a number between and .
The BLEU score has several open-source implementations. It is used in a variety of systems like machine translation and image captioning.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation.