Artificial Intelligence 🤖
Recurrent Neural Networks (RNNs)
Seq2Seq

Sequence to Sequence Models

Sequence to sequence models, often referred to as Many to Many models, are crucial in driving progress in areas like machine translation, speech recognition, and image captioning.

Basic Model Architecture

Consider a machine translation problem where XX represents a sequence in French and YY denotes its English translation.

Translation

The foundational model architecture consists of two key components:

  • Encoder: This could be an RNN, LSTM, or GRU, designed to process the input sequence (e.g., a French sentence) and compress its information into a context vector.

  • Decoder: Also an RNN, LSTM, or GRU, which takes the context vector and generates the corresponding output sequence (e.g., the translated English sentence).

These ideas are from the following papers:

  • Sutskever et al., 2014. Sequence to sequence learning with neural networks
  • Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation

For image captioning, where XX is an image and YY is a caption, the model employs a pretrained CNN (like AlexNet) as the encoder and an RNN as the decoder.

Captioning

They do not generate the sentence word by word, but instead need an entire sentence in one go. Vocab would be too large to enumerate all the choices in one, so we use a search algorithm for this. These ideas are from the following papers (they share similar ideas):

  • Maoet et. al., 2014. Deep captioning with multimodal recurrent neural networks
  • Vinyals et. al., 2014. Show and tell: Neural image caption generator
  • Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions

Selecting the Most Likely Sentence

The language model we have been looking at is very similar to the decoder part of the machine translation model, except for a<0>a^{<0>}. In the language model, we always start with a<0>=0a^{<0>} = 0. In the machine translation model, we will start with a<0>a^{<0>} being a representation of the input sentence.

Selecting the Most Likely Sentence

The problems formulations are also are different:

  • In language model: P(y<1>,...,y<Ty>)P(y^{<1>}, ..., y^{<Ty>})
  • In machine translation : P(y<1>,...,y<Ty>x<1>,...,x<Tx>)P(y^{<1>}, ..., y^{<Ty>} | x^{<1>}, ..., x^{<Tx>})

In machine translation, the goal isn't to randomly sample output sentences. This may provide some choices as an output. For example, given the French sentence "Jane visite l'Afrique en septembre," the translation could vary:

  • "Jane is visiting Africa in September."
  • "Jane is going to be visiting Africa in September."
  • "In September, Jane will visit Africa."

The objective is to select the best possible translation. A popular algorithm for this is beam search, which outperforms greedy approaches that might otherwise yield suboptimal solutions like "Jane is going to be visiting Africa in September". Essentially, we want to maximize:

arg maxy^<1>,...,y^<Ty>P(y^<1>,...,y^<Ty>x)\text{arg max}_{\hat{y}^{<1>}, ..., \hat{y}^{<T_y>}} P(\hat{y}^{<1>}, ..., \hat{y}^{<T_y>} | x)

Where xx is the sentence in French. The most common algorithm is the beam search. Why not use greedy search? Why not get the best choices each time? It turns out that this approach doesn't really work! Let's explain it with an example.

Say the best output for the example we talked about is "Jane is visiting Africa in September."

Suppose that when you are choosing with greedy approach, the first two words were "Jane is", the word that may come after that will be "going" as "going" is the most common word that comes after "is" so the result may look like this: "Jane is going to be visiting Africa in September". And that isn't the best/optimal solution.

So, a better approach than the greedy approach, is to get an approximate solution, that will try to maximize the output (the last equation above). The most common thing to do is use an approximate search out of them. And, what an approximate search algorithm does is it will try, it won't always succeed, but it will to pick the sentence, yy, that maximizes that conditional probability.

Beam Search

Beam search, a heuristic search algorithm, is widely used for deriving the best output sequences. It involves selecting a set number of most likely word sequences at each step, guided by a parameter known as the beam width (BB). For instance, with B=3B=3, the algorithm considers the top three candidate words or sequences at each step, ensuring that only the most likely paths are pursued.

For example, the first step you may get get ["in", "jane", "september"] - these are words that are the best candidates for the first word.

Then for each word in the first output, we get BB next second words and select the top best BB combinations, whereby the best are those that give the highest value when the probabilities are multiplied:

P(y<1>x)P(y<2>x,y<1>)P(y^{<1>}|x) \ast P(y^{<2>}|x,y^{<1>})

As we care about the pair that is the most likely, we multiply the probabilities. We may then have ["in september", "jane is", "jane visit"]. Notice that we automatically discard September as a first word, as we keep BB candidates.

We repeat the same process and get the best BB words for ["september", "is", "visit"] and so on. In this algorithm, keep only BB instances of your network. Note also that if B=1B=1 this will become the same as the greedy search.

Refinements to Beam Search

The first refinement is Length Optimization. To counter the tendency of producing overly short translations, length normalization is applied. This involves summing the logarithms of probabilities for numerical stability and adjusting for sequence length, controlled by a hyperparameter α\alpha, typically set around 0.70.7.

To illustrate, in beam search, we are trying to maximize:

arg maxy^<1>,...,y^<Ty>P(y^<1>,...,y^<Ty>x)\text{arg max}_{\hat{y}^{<1>}, ..., \hat{y}^{<T_y>}} P(\hat{y}^{<1>}, ..., \hat{y}^{<T_y>} | x)

To do this, we multiply:

P(y<1>x)P(y<2>x,y<1>)P(y<Ty>x,y<1>,,y<Ty1>)P(y^{<1>}|x) \ast P(y^{<2>}|x,y^{<1>}) \ast \ldots \ast P(y^{<T_y>}|x,y^{<1>}, \ldots, y^{<T_y-1>})

In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations. Each probability is a fraction, and most of the time a small fraction. Multiplying small fractions will cause a numerical underflow. Meaning that it's too small for the floating part representation in your computer to store accurately. So, in practice we tend to sum the logs of probabilities instead of multiplying directly:

arg maxyy=1TylogP(y<t>x,y<1>,,y<t1>)\text{arg max}_y \sum_{y=1}^{T_y} \log P(y^{<t>}|x,y^{<1>}, \ldots, y^{<t-1>})

By taking log\log's , you end up with a more numerically stable algorithm that is less prone to rounding errors (numerical underflow). The log\log function is a strictly monotonically increasing function, so maximising that should also minimise the log.

But there's another problem. The two optimization functions we have mentioned are preferring small sequences rather than long ones. Because multiplying more fractions gives a smaller value, so fewer fractions will give a bigger result. So, there's another step - dividing by the number of elements in the sequence.

1Tyαarg maxyy=1TylogP(y<t>x,y<1>,,y<t1>)\frac{1}{T_y^\alpha} \cdot \text{arg max}_y \sum_{y=1}^{T_y} \log P(y^{<t>}|x,y^{<1>}, \ldots, y^{<t-1>})

We can normalize this by the number of words in the translation. This significantly reduces the penalty for outputting longer translations. Here:

  • TyT_y is the length of the translation
  • α\alpha is a hyperparameter to tune, typically set to 0.70.7
  • If α=0\alpha = 0 no sequence length normalization
  • If α=1\alpha = 1 full sequence length normalization.

The second refinement is choosing Beam Width (BB). A larger BB increases the number of possibilities explored, potentially improving results at the cost of higher computational expense. Commonly, BB is set to 10 in production settings, though larger values may be used in research.

Unlike exact search algorithms like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search runs faster but is not guaranteed to find the exact solution.

As you run beam search, you will see B sentences at different lengths, up until 30 let's say e.g, 1 ,2, 3, 4 ... 30 then score based on the objective function and pick the one that achieves the highest value on this normalized log probability objective i.e., normalized log likelihood objective.

Error Analysis in Beam Search

The concepts of error analysis can also help us distinguish whether inaccuracies stem from the beam search or the RNN model. By comparing probabilities of the ideal translation (yy^\ast) and the model-generated translation (y^\hat{y}), one can identify which component is underperforming.

Suppose that we have the following:

  • Initial info:
    • xx = "Jane visite l'Afrique en septembre."
    • yy^\ast = "Jane visits Africa in September."
    • The ideal answer generated by a human
  • y^\hat{y} = "Jane visited Africa last September."
    • Answer produced by model.
    • Our model that has produced a bad result.

We now want to know who to blame - the RNN or the beam search. To do that, we calculate P(yX)P(y^\ast | X) and P(y^X)P(\hat{y} | X). There are two cases:

  • Case 1: P(yX)>P(y^X)P(y^\ast | X) > P(\hat{y} | X)
    • Conclusion: Beam search is at fault.
    • Beam search chose y^\hat{y} even though yy^\ast has a higher probability
  • Case 2: P(yX)P(y^X)P(y^\ast | X) \leq P(\hat{y} | X)
    • Conclusion: RNN model is at fault.
    • yy^\ast is a better translation, but the RNN predicts that y^>y\hat{y}>y^\ast

Do this for NN examples and count how many times the beam search is at fault and how many times the RNN model is at fault. This will help you decide what to work on next since you can figure out which faction of errors are ascribed to the search algorithm instead of the RNN model.

BLEU Score (Bilingual Evaluation Understudy)

The BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate machine translations. Given a sentence in a language there are one or more possible good translation in another language. It compares the machine-generated translation to one or more human reference translations, using a modified precision measure that accounts for the frequency of words and nn-grams in the reference texts.

The BLEU score incorporates a brevity penalty (BPBP) to address the tendency of favoring shorter outputs. For a comprehensive evaluation, BLEU scores across different nn-grams (unigrams, bigrams, etc.) and averages.

The intuition is: as long as the machine-generated translation is pretty close to any of the references provided by humans, then it will get a high BLEU score.

Let's take an example:

  • XX = "Le chat est sur le tapis."
  • Y1Y_1 = "The cat is on the mat." (Human reference 1)
  • Y2Y_2 = "There is a cat on the mat." (Human reference 2)

Suppose that the machine outputs: "the the the the the the the.". One way to evaluate the machine output is to look at each word in the output and check if it is in the references. The precision p=77p = \frac{7}{7} because "the" appeared in Y1Y_1 or Y2Y_2. This is not a useful measure! Basically, "What fraction of words in the output also appear in ref" is not a good scoring measure.

Instead, what we're going to use is a modified precision measure in which we will give each word credit only up to the maximum number of times it appears in the reference sentences. We are looking for the reference with the maximum number of a particular word and set the maximum appearing of this word to this number. So, modified precision p=27p = \frac{2}{7} because the max is 22 in Y1Y_1. We clipped the 77 times by the max which is 22. Here we are looking at one word at a time - unigrams, we may look at nn-grams too.

BLEU Score on Bigrams

The nn-grams typically are collected from a text or speech corpus. While the items are words, nn-grams may also be called shingles. An nn-gram of size 1 is referred to as a unigram; size 2 is a bigram (or, less commonly, a digram); size 3 is a trigram etc.

  • XX = "Le chat est sur le tapis."
  • Y1Y_1 = "The cat is on the mat." (Human reference 1)
  • Y2Y_2 = "There is a cat on the mat." (Human reference 2)

Suppose that the machine outputs: "the cat the cat on the mat." The bigrams in the machine output are:

PairsCountCount clip
the cat21 (Y1Y_1)
cat the10
cat on11 (Y2Y_2)
on the11 (Y1Y_1)
the mat11 (Y1Y_1)
Totals64

The modified precision for unigrams:

p1=unigramy^countclip(unigram)unigramy^count(unigram)p_1 = \frac{\sum_{\text{unigram} \in \hat{y}} \text{count}_{\text{clip}} (\text{unigram})}{\sum_{\text{unigram} \in \hat{y}} \text{count} (\text{unigram})}

And for nn-grams:

pn=ngramy^countclip(ngram)ngramy^count(ngram)p_n = \frac{\sum_{\text{ngram} \in \hat{y}} \text{count}_{\text{clip}} (\text{ngram})}{\sum_{\text{ngram} \in \hat{y}} \text{count} (\text{ngram})}

Let's put this together to formalize the BLEU score:

BLEU=BPexp(1nn=1Npn)BLEU = BP \cdot \exp \left( \frac{1}{n} \cdot \sum_{n=1}^{N} p_n \right)

For example, if we want BLEU for 4, we compute p1p_1, p2p_2, p3p_3, p4p_4 and then average them and take the expexp.

Here BPBP is the brevity penalty and NN is the number of nn-grams. The brevity penalty is a penalty term that penalizes the machine translation model for outputting a translation that's too short. It's a number between 00 and 11.

BP={1if y^oputput length>yoputput lengthe(1y^oputput lengthyoputput length)if y^oputput lengthyoputput lengthBP = \begin{cases} 1 & \text{if } \hat{y}_\text{oputput length} > y_\text{oputput length} \\ e^{\left(1 - \frac{\hat{y}_\text{oputput length}}{y_\text{oputput length}}\right)} & \text{if } \hat{y}_\text{oputput length} \leq y_\text{oputput length} \end{cases}

The BLEU score has several open-source implementations. It is used in a variety of systems like machine translation and image captioning.

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation.