Sequence to Sequence Models

Sequence to sequence models, often referred to as Many to Many models, are crucial in driving progress in areas like machine translation, speech recognition, and image captioning.

Basic Model Architecture

Consider a machine translation problem where $X$ represents a sequence in French and $Y$ denotes its English translation.

Translation

The foundational model architecture consists of two key components:

Encoder: This could be an RNN, LSTM, or GRU, designed to process the input sequence (e.g., a French sentence) and compress its information into a context vector.
Decoder: Also an RNN, LSTM, or GRU, which takes the context vector and generates the corresponding output sequence (e.g., the translated English sentence).

These ideas are from the following papers:

Sutskever et al., 2014. Sequence to sequence learning with neural networks
Cho et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation

For image captioning, where $X$ is an image and $Y$ is a caption, the model employs a pretrained CNN (like AlexNet) as the encoder and an RNN as the decoder.

Captioning

They do not generate the sentence word by word, but instead need an entire sentence in one go. Vocab would be too large to enumerate all the choices in one, so we use a search algorithm for this. These ideas are from the following papers (they share similar ideas):

Maoet et. al., 2014. Deep captioning with multimodal recurrent neural networks
Vinyals et. al., 2014. Show and tell: Neural image caption generator
Karpathy and Li, 2015. Deep visual-semantic alignments for generating image descriptions

Selecting the Most Likely Sentence

The language model we have been looking at is very similar to the decoder part of the machine translation model, except for $a^{<0>}$ . In the language model, we always start with $a^{<0>} = 0$ . In the machine translation model, we will start with $a^{<0>}$ being a representation of the input sentence.

Selecting the Most Likely Sentence

The problems formulations are also are different:

In language model: $P(y^{<1>}, ..., y^{<Ty>})$
In machine translation : $P(y^{<1>}, ..., y^{<Ty>} | x^{<1>}, ..., x^{<Tx>})$

In machine translation, the goal isn't to randomly sample output sentences. This may provide some choices as an output. For example, given the French sentence "Jane visite l'Afrique en septembre," the translation could vary:

"Jane is visiting Africa in September."
"Jane is going to be visiting Africa in September."
"In September, Jane will visit Africa."

The objective is to select the best possible translation. A popular algorithm for this is beam search, which outperforms greedy approaches that might otherwise yield suboptimal solutions like "Jane is going to be visiting Africa in September". Essentially, we want to maximize:

\text{arg max}_{\hat{y}^{<1>}, ..., \hat{y}^{<T_y>}} P(\hat{y}^{<1>}, ..., \hat{y}^{<T_y>} | x)

Where $x$ is the sentence in French. The most common algorithm is the beam search. Why not use greedy search? Why not get the best choices each time? It turns out that this approach doesn't really work! Let's explain it with an example.

Say the best output for the example we talked about is "Jane is visiting Africa in September."

Suppose that when you are choosing with greedy approach, the first two words were "Jane is", the word that may come after that will be "going" as "going" is the most common word that comes after "is" so the result may look like this: "Jane is going to be visiting Africa in September". And that isn't the best/optimal solution.

So, a better approach than the greedy approach, is to get an approximate solution, that will try to maximize the output (the last equation above). The most common thing to do is use an approximate search out of them. And, what an approximate search algorithm does is it will try, it won't always succeed, but it will to pick the sentence, $y$ , that maximizes that conditional probability.

Beam Search

Beam search, a heuristic search algorithm, is widely used for deriving the best output sequences. It involves selecting a set number of most likely word sequences at each step, guided by a parameter known as the beam width ( $B$ ). For instance, with $B=3$ , the algorithm considers the top three candidate words or sequences at each step, ensuring that only the most likely paths are pursued.

For example, the first step you may get get ["in", "jane", "september"] - these are words that are the best candidates for the first word.

Then for each word in the first output, we get $B$ next second words and select the top best $B$ combinations, whereby the best are those that give the highest value when the probabilities are multiplied:

P(y^{<1>}|x) \ast P(y^{<2>}|x,y^{<1>})

As we care about the pair that is the most likely, we multiply the probabilities. We may then have ["in september", "jane is", "jane visit"]. Notice that we automatically discard September as a first word, as we keep $B$ candidates.

We repeat the same process and get the best $B$ words for ["september", "is", "visit"] and so on. In this algorithm, keep only $B$ instances of your network. Note also that if $B=1$ this will become the same as the greedy search.

Refinements to Beam Search

The first refinement is Length Optimization. To counter the tendency of producing overly short translations, length normalization is applied. This involves summing the logarithms of probabilities for numerical stability and adjusting for sequence length, controlled by a hyperparameter $\alpha$ , typically set around $0.7$ .

To illustrate, in beam search, we are trying to maximize:

\text{arg max}_{\hat{y}^{<1>}, ..., \hat{y}^{<T_y>}} P(\hat{y}^{<1>}, ..., \hat{y}^{<T_y>} | x)

To do this, we multiply:

P(y^{<1>}|x) \ast P(y^{<2>}|x,y^{<1>}) \ast \ldots \ast P(y^{<T_y>}|x,y^{<1>}, \ldots, y^{<T_y-1>})

In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations. Each probability is a fraction, and most of the time a small fraction. Multiplying small fractions will cause a numerical underflow. Meaning that it's too small for the floating part representation in your computer to store accurately. So, in practice we tend to sum the logs of probabilities instead of multiplying directly:

\text{arg max}_y \sum_{y=1}^{T_y} \log P(y^{<t>}|x,y^{<1>}, \ldots, y^{<t-1>})

By taking $\log$ 's , you end up with a more numerically stable algorithm that is less prone to rounding errors (numerical underflow). The $\log$ function is a strictly monotonically increasing function, so maximising that should also minimise the log.

But there's another problem. The two optimization functions we have mentioned are preferring small sequences rather than long ones. Because multiplying more fractions gives a smaller value, so fewer fractions will give a bigger result. So, there's another step - dividing by the number of elements in the sequence.

\frac{1}{T_y^\alpha} \cdot \text{arg max}_y \sum_{y=1}^{T_y} \log P(y^{<t>}|x,y^{<1>}, \ldots, y^{<t-1>})

We can normalize this by the number of words in the translation. This significantly reduces the penalty for outputting longer translations. Here:

$T_y$ is the length of the translation
$\alpha$ is a hyperparameter to tune, typically set to $0.7$
If $\alpha = 0$ no sequence length normalization
If $\alpha = 1$ full sequence length normalization.

The second refinement is choosing Beam Width ( $B$ ). A larger $B$ increases the number of possibilities explored, potentially improving results at the cost of higher computational expense. Commonly, $B$ is set to 10 in production settings, though larger values may be used in research.

Unlike exact search algorithms like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search runs faster but is not guaranteed to find the exact solution.

As you run beam search, you will see B sentences at different lengths, up until 30 let's say e.g, 1 ,2, 3, 4 ... 30 then score based on the objective function and pick the one that achieves the highest value on this normalized log probability objective i.e., normalized log likelihood objective.

Error Analysis in Beam Search

The concepts of error analysis can also help us distinguish whether inaccuracies stem from the beam search or the RNN model. By comparing probabilities of the ideal translation ( $y^\ast$ ) and the model-generated translation ( $\hat{y}$ ), one can identify which component is underperforming.

Suppose that we have the following:

Initial info:
- $x$ = "Jane visite l'Afrique en septembre."
- $y^\ast$ = "Jane visits Africa in September."
- The ideal answer generated by a human
$\hat{y}$ $\overset{y}{^}$ = "Jane visited Africa last September."
- Answer produced by model.
- Our model that has produced a bad result.

We now want to know who to blame - the RNN or the beam search. To do that, we calculate $P(y^\ast | X)$ and $P(\hat{y} | X)$ . There are two cases:

Case 1: $P(y^\ast | X) > P(\hat{y} | X)$ $P (y^{*} ∣ X) > P (\overset{y}{^} ∣ X)$
- Conclusion: Beam search is at fault.
- Beam search chose $\hat{y}$ even though $y^\ast$ has a higher probability
Case 2: $P(y^\ast | X) \leq P(\hat{y} | X)$ $P (y^{*} ∣ X) \leq P (\overset{y}{^} ∣ X)$
- Conclusion: RNN model is at fault.
- $y^\ast$ is a better translation, but the RNN predicts that $\hat{y}>y^\ast$

Do this for $N$ examples and count how many times the beam search is at fault and how many times the RNN model is at fault. This will help you decide what to work on next since you can figure out which faction of errors are ascribed to the search algorithm instead of the RNN model.

BLEU Score (Bilingual Evaluation Understudy)

The BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate machine translations. Given a sentence in a language there are one or more possible good translation in another language. It compares the machine-generated translation to one or more human reference translations, using a modified precision measure that accounts for the frequency of words and $n$ -grams in the reference texts.

The BLEU score incorporates a brevity penalty ( $BP$ ) to address the tendency of favoring shorter outputs. For a comprehensive evaluation, BLEU scores across different $n$ -grams (unigrams, bigrams, etc.) and averages.

The intuition is: as long as the machine-generated translation is pretty close to any of the references provided by humans, then it will get a high BLEU score.

Let's take an example:

$X$ = "Le chat est sur le tapis."
$Y_1$ = "The cat is on the mat." (Human reference 1)
$Y_2$ = "There is a cat on the mat." (Human reference 2)

Suppose that the machine outputs: "the the the the the the the.". One way to evaluate the machine output is to look at each word in the output and check if it is in the references. The precision $p = \frac{7}{7}$ because "the" appeared in $Y_1$ or $Y_2$ . This is not a useful measure! Basically, "What fraction of words in the output also appear in ref" is not a good scoring measure.

Instead, what we're going to use is a modified precision measure in which we will give each word credit only up to the maximum number of times it appears in the reference sentences. We are looking for the reference with the maximum number of a particular word and set the maximum appearing of this word to this number. So, modified precision $p = \frac{2}{7}$ because the max is $2$ in $Y_1$ . We clipped the $7$ times by the max which is $2$ . Here we are looking at one word at a time - unigrams, we may look at $n$ -grams too.

BLEU Score on Bigrams

The $n$ -grams typically are collected from a text or speech corpus. While the items are words, $n$ -grams may also be called shingles. An $n$ -gram of size 1 is referred to as a unigram; size 2 is a bigram (or, less commonly, a digram); size 3 is a trigram etc.

$X$ = "Le chat est sur le tapis."
$Y_1$ = "The cat is on the mat." (Human reference 1)
$Y_2$ = "There is a cat on the mat." (Human reference 2)

Suppose that the machine outputs: "the cat the cat on the mat." The bigrams in the machine output are:

Pairs	Count	Count clip
the cat	2	1 ( $Y_1$ )
cat the	1	0
cat on	1	1 ( $Y_2$ )
on the	1	1 ( $Y_1$ )
the mat	1	1 ( $Y_1$ )
Totals	6	4

The modified precision for unigrams:

p_1 = \frac{\sum_{\text{unigram} \in \hat{y}} \text{count}_{\text{clip}} (\text{unigram})}{\sum_{\text{unigram} \in \hat{y}} \text{count} (\text{unigram})}

And for $n$ -grams:

p_n = \frac{\sum_{\text{ngram} \in \hat{y}} \text{count}_{\text{clip}} (\text{ngram})}{\sum_{\text{ngram} \in \hat{y}} \text{count} (\text{ngram})}

Let's put this together to formalize the BLEU score:

BLEU = BP \cdot \exp \left( \frac{1}{n} \cdot \sum_{n=1}^{N} p_n \right)

For example, if we want BLEU for 4, we compute $p_1$ , $p_2$ , $p_3$ , $p_4$ and then average them and take the $exp$ .

Here $BP$ is the brevity penalty and $N$ is the number of $n$ -grams. The brevity penalty is a penalty term that penalizes the machine translation model for outputting a translation that's too short. It's a number between $0$ and $1$ .

BP = \begin{cases} 1 & \text{if } \hat{y}_\text{oputput length} > y_\text{oputput length} \\ e^{\left(1 - \frac{\hat{y}_\text{oputput length}}{y_\text{oputput length}}\right)} & \text{if } \hat{y}_\text{oputput length} \leq y_\text{oputput length} \end{cases}

The BLEU score has several open-source implementations. It is used in a variety of systems like machine translation and image captioning.

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation.

Word Embeddings Attention Mechanism