Attention Mechanism

Long Short-Term Memory (LSTM) models have substantially improved the capabilities of Recurrent Neural Networks (RNNs). However the attention mechanism represents a significant advancement to make these models even better. The core idea behind attention is to enable the model at each step to selectively focus on a subset of relevant information from a larger pool of data.

Take this model, inputs and outputs for example:

The encoder here needs to memorize this long sequence into one vector, and the decoder has to process this vector to generate the translation. If a human were to translate this sentence, he/she wouldn't read the whole sentence and memorize it then try to translate it. He/she translates a part at a time. We can also see that the performance of this model decreases as sentence length increases. Here, the $x$ axis is sentence length, blue indicates is the normal model, while green is the model with attention mechanism:

For another example, when an RNN generates a description for an image, it might selectively concentrate on different segments of the image for each word it produces. This technique has been successfully implemented by Xu et al. (2015), and it provides an excellent entry point for those interested in exploring attention mechanisms. The attention mechanism has yielded exciting results across various applications, indicating that its potential is far from being fully tapped.

Sequence models benefit from the integration of attention mechanisms, which guide the model's focus through a sequence of inputs. This approach is particularly effective in tasks such as speech recognition, where it aids the model in processing audio data more efficiently. At first, the attention model was developed for machine translation but other applications like computer vision and new architectures like the Neural Turing machine have been developed using attention mechanisms aswell.

The attention model was described in these papers:

Bahdanau et. al., 2014. Neural machine translation by jointly learning to align and translate

Xu et. al., 2015. Show, attend and tell: Neural image caption generation with visual attention

Understanding Attention Models

Sequence to sequence models, utilizing encoders and decoders, have been a standard approach until the introduction of attention mechanisms, which have significantly enhanced their performance. Suppose that our encoder is a bidirectional RNN:

Bidirectional RNN

The encoder receives a sentence in French and generates a representative vector. To produce the first word in English, "Jane," the decoder, also an RNN, needs to determine which parts of the French sentence are relevant. The model computes attention weights to focus selectively on the important words, in this case "Jane", "visite" and "l'Afrique". These weights dictate the model's focus at each step of the translation process, as shown below:

The model continues until it reaches an end-of-sequence marker <EOS>. For notation, $\alpha^{<3, t>}$ is the weight that the decoder RNN at step 3 should be paying attention to French word at time step $t$ . Here, the attention weights $\alpha^{<1,1>}, \alpha^{<1,2>}, \alpha^{<1,3>}$ illustrate the model's focus points. For each English word generated, a set of attention weights controls the focus on the corresponding French words:

And so to generate any word there will be a set of attention weights that controls which words we are looking at right now.

Formalizing the Attention Model

Building on our intuition, we can detail the implementation of the attention model. It begins with a bidirectional RNN, often an LSTM, to encode the input sequence:

Input LSTM

The feature vector at input timestep $t'$ , $\mathbf{a}^{<t'>}$ , comprises both forward and backward activations, with the forward and backward occurrence concatenated together:

\mathbf{a}^{<t'>} = \left[ \mathbf{\overrightarrow{a}}^{<t'>}, \mathbf{\overleftarrow{a}}^{<t'>} \right]

For learning purposes, lets assume that $\mathbf{a}^{<t'>}$ will include the both directions activations at time step $t'$ . Here the prime is used to index the French Sentence. We will have a unidirectional RNN to produce the output using a context $c$ which is computed using the attention weights, which denote how much information does the output needs to look in $\mathbf{a}^{<t'>}$ at each timestep.

The sum of the attention weights for each element in the sequence should be 1:

\sum_{t'=1}^{T_x} \alpha^{<t, t'>} = 1

The context $c$ for decoder step $t$ is calculated using this equation:

c^{<t>} = \sum_{t'=1}^{T_x} \alpha^{<t, t'>} \mathbf{a}^{<t'>}

Let's see how can we compute the attention weights. Conceptually, $\alpha^{<t,t'>}$ is the amount of attention $y^{<t>}$ should pay to $a^{<t'>}$ . So, for example, if $a^{<1>}$ is "Jane", then $\alpha^{<3,1>}$ should be large if the output $y^{<3>}$ is predicting the name "Jane". To do this, we are going to use a softmax like calculation to ensure the attention weights sum to 1:

\alpha^{<t, t'>} = \frac{\exp(e^{<t, t'>})}{\sum_{t'=1}^{T_x} \exp(e^{<t, t'>})}

Where $e^{<t, t'>}$ is called the "alignment" function. We expect $\alpha^{<t, t'>}$ to be generally larger for values of $\alpha$ that are highly relevant to the value the network should output for $\hat{y}$ .

We will compute the "alignment function" $e^{<t, t'>}$ using a small neural network (usually 1-layer, because we will need to compute this a lot):

Alignment Function

Here, $s^{<t-1>}$ is the hidden state of the RNN, and $\mathbf{a}^{<t'>}$ is the activation of the other bidirectional RNN. It is natural that $\alpha^{<t, t'>}$ and $e^{<t, t'>}$ should depend on these two quantities. But we don't know what the function is. So, one thing you could do is just train a very small neural network to learn whatever this function should be. One of the disadvantages of this algorithm is that it takes quadratic time or quadratic cost to run.

If you have $t_x$ words in the input and $t_y$ words in the output then the total number of these attention parameters are going to be $t_x \cdot t_y$ . In machine translation applications where neither input nor output sentences are usually that long, then maybe quadratic cost is actually acceptable.

One fun way to see how attention works is by visualizing the attention weights:

Colors highlight the magnitude of attention weights, with higher correlations between the focused input and output words. This visualization demonstrates that the model has learned where to pay attention through training with the attention mechanism.

Building a Machine Translation Attention Model with Attention

We can implement a machine translation model with attention using two separate LSTMs:

Attention Model Architecture

This model features two LSTM layers with specific roles:

Pre-Attention Bi-LSTM: This network processes the input sequence in both directions, capturing the nuances of the context before it reaches the attention mechanism. It unfolds over $T_x$ time steps, each step corresponding to a word or token in the input sequence.
Post-Attention LSTM: After the attention mechanism has determined the focus areas within the input sequence, this LSTM network generates the output sequence one time step at a time over $T_y$ steps. At each step $t$ , it produces a single word or token in the target language.

The post-attention LSTM is responsible for carrying the hidden state $s^{<t>}$ and the context $c^{<t>}$ from one time step to the next. Unlike simpler RNNs, the LSTM maintains both the output activation $s^{<t>}$ and an internal cell state $c^{<t>}$ , enhancing the model's memory capabilities.

A key design choice in this model is that the post-attention LSTM at time step $t$ does not use the previously generated output $y^{<t-1>}$ as input. Instead, it relies on the hidden state $s^{<t>}$ and context $c^{<t>}$ only. We have designed the model this way, because (unlike language generation where adjacent characters are highly correlated) there isn't as strong a dependency between the previous character and the next character in a YYYY- MM-DD date.

Understanding the Attention Mechanism

Each "Attention" step within the model computes attention variables $\alpha^{<t, t'>}$ , which contribute to determining the context variable $c^{<t>}$ for every output time step (ranging from $t=1$ to $T_y$ ).

Attention Mechanism Step

The above is built out in Keras using:

A RepeatVector node that duplicates the hidden state $s^{<t-1>}$ $T_x$ times.
A Concatenate operation that merges the replicated hidden state $s^{<t-1>}$ with the encoded input $a^{<t'>}$ to calculate the relevance score $e^{<t, t'>}$ .
A softmax layer that normalizes these relevance scores to produce the attention weights $\alpha^{<t, t'>}$ , ensuring they sum to 1 and indicating the degree of focus the model should allocate to each input time step when generating each word in the output sequence.

The attention architecture thus allows the model to dynamically focus on different parts of the input sequence at each step of the output generation, closely mimicking the way humans translate languages by considering relevant parts of the source material as needed.

Speech Recognition & Attention

Speech recognition has made leaps forward with sequence-to-sequence models, providing highly accurate transcription services.Defining the speech recognition problem:

$X$ : The audio clip
$Y$ : The corresponding transcript

When visualized, an audio clip appears as a waveform, with time on the horizontal axis and air pressure variations - interpreted as sound by our ears - on the vertical axis.

Waveform Image

You can think of an audio recording as a long list of numbers measuring the little air pressure changes detected by the microphone.

A standard audio recording is captured at a sampling rate of 44100 Hz, meaning 44100 data points per second represent the minute pressure changes. Consequently, a 10-second clip is encapsulated by 441000 data points.

Directly handling this raw audio form is complex, which is why we often transform it into a spectrogram, a representation more akin to the human auditory system.

Spectrogram Illustration

A spectrogram plots time against frequency, with color intensity indicating energy levels - essentially, the loudness at various frequencies. It's produced by segmenting the audio with a sliding window and applying a Fourier transform to capture frequency activity within each segment.

Historically, speech recognition systems relied on phonemes (distinct units of sound e.g. "kuh, lee") assembled by linguists. However, the advent of end-to-end deep learning has diminished the need for such manual interventions, thanks to vast audio datasets. Contemporary systems, some trained on upwards of 100,000 hours of audio, leverage models like the:

Attention Model in Speech Recognition

A notable technique for these models is the CTC cost, or "Connectionist Temporal Classification", which facilitates the alignment of input sequences with their transcriptions, even when the input is substantially longer than the output. To explain this let's say that $Y$ = "the quick brown fox". We are going to use an RNN with input, output structure:

RNN with CTC Cost Illustration

This is a unidirectional RNN, but in practice a bidirectional RNN is used. Notice, that the number of inputs and number of outputs are the same here, but in the speech recognition problem input $X$ tends to be a lot larger than output $Y$ . 10 seconds of audio at 100Hz gives us $X$ with shape $(1000, )$ . These 10 seconds don't contain 1000-character outputs.

CTC employs a special "blank" character to handle alignment, allowing for sequences of varying lengths to be matched with their transcriptions.

ttt_h_eee<SPC>____<SPC>qqq___uu

The _ is a special character called "blank" and <SPC> is for the "space" character. The basic rule for CTC is to collapse repeated characters not separated by a "blank". So the 19 character in our Y can be generated into 1000 character output using CTC and its special blanks. The ideas were taken from this paper:

Graves et al., 2006. Connectionist Temporal Classification: Labeling unsegmented sequence data with recurrent

This paper's ideas were also used by Baidu's DeepSpeech. Using both attention model and CTC cost can help you to build an accurate speech recognition system.

Trigger Word Detection

Deep learning has also introduced devices that activate with specific spoken commands - these are known as trigger word detection systems. Trigger word detection systems include:

Amazon Echo ("Alexa")
Google Home ("Okay Google")
Apple HomePod ("Hey Siri")

The literature behind trigger word detection systems is still evolving, but typically involves:

$X$ : An audio clip
$X$ : Transformed into a spectrogram

$Y$ is a binary label, where 0 indicates the absence and 1 indicates the presence of the trigger word. The model architecture can be like this:

Trigger Word Detection Model Architecture

The vertical lines in the audio clip represent moment just after the trigger word. The corresponding output to this will be 1.

One disadvantage of this is the imbalances in the dataset - predominantly 0s - poses a challenge, which can be mitigated by extending the duration of the 1s following a trigger word, balancing the distribution of labels.

Balancing Technique for Trigger Word Detection

Attention isn't the only exciting trend in RNN research. For example, Grid LSTMs by Kalchbrenner, et al. (2015) seem extremely promising. Work using RNNs in generative models - such as Gregor, et al. (2015), Chung, et al. (2015), or Bayer & Osendorfer (2015) - also seems very interesting. The last few years have been an exciting time for recurrent neural networks, and the coming ones promise to only be more so!

Seq2Seq Transformer Networks