The Recurrent Neural Network Model

Why Not Use Standard Networks for Sequence Tasks? Standard networks face challenges with sequence tasks due to:

Variable Input and Output Lengths: Different training examples can have varying lengths. Padding to maximum lengths is a workaround but not an optimal solution for standard neural networks (NNs).
Feature Sharing Across Positions: Unlike Convolutional Neural Networks (CNNs), standard NNs do not share features learned across different positions in a sequence. Recurrent Neural Networks (RNNs) address this by allowing feature sharing.

RNNs don't have either of the two mentioned problems.

Basic RNN Architecture

Let's build a RNN that solves the name entity recognition task (here we use unidirectional neural network architecture):

In this example, $T_x = T_y$ . In cases where they differ, the RNN architecture would need to be adapted. The initial activation $a^{<0>}$ is typically zero-initialized, although random initialization is sometimes used. The RNN uses three weight matrices:

$W_{ax}$ : Shape $(\text{NoOfHiddenNeurons}, n_x)$
$W_{aa}$ : Shape $(\text{NoOfHiddenNeurons}, \text{NoOfHiddenNeurons})$ . This matrix acts as the memory of the RNN, maintaining information from previous layers.
$W_{ya}$ : Shape $(n_y, \text{NoOfHiddenNeurons})$ .

A lot of papers and books write the same architecture this way:

Basic RNN architecture

It's harder to interpret. So, It's easier to unroll these loops.

An unrolled RNN

This chain-like nature reveals that RNNS are intimately related to sequences and lists. There are limitations to unidirectional RNNs however. For example, the current output $\hat{y}^{<t>}$ depends only on the previous inputs and activations. Let's take this example:

'He Said, "Teddy Roosevelt was a great president"'

In this example, Teddy is a person's name. We know that from the word president that came after Teddy, NOT from He or said that came before it. So, a limitation of the discussed architecture is that it cannot learn from elements later in the sequence. To address this problem, we will later discuss Bidirectional RNNs (BRNN).

Forward Propagation

Forward propagation in RNNs is the sequential process where the network processes input data over time steps. Consider the basic RNN architecture again:

Basic RNN architecture

Here, we initially pass in a 0 vectors as the initial activation $a^{<0>}$ . Then, for each time step $t$ , we compute the activation $a^{<t>}$ and the output $\hat{y}^{<t>}$ . For the first layer for example:

a^{<0>} = \vec{0}

a^{<1>} = g_1(W_{aa} a^{<0>} + W_{ax} x^{<1>} + b_a)

Where $g_1$ is the activation function for the hidden layer, usually $tanh$ or RELU. The output $\hat{y}^{<1>}$ is computed as:

\hat{y}^{<1>} = g_2(W_{ya} a^{<1>} + b_y)

Where $g_2$ is the activation function for the output layer, usually sigmoid or softmax. The same process is repeated for each time step $t$ .

In general, $a^{<t>}$ represents the hidden layer's activation at time step $t$ . $x^{<t>}$ is the input at the same time step. The predicted output $\hat{y}^{<t>}$ and the actual output $y^{<t>}$ are both at time $t$ . The computation of $a^{<t>}$ is based on the previous time step's activation $a^{<t-1>}$ and the current input $x^{<t>}$ .

Typical activation functions for the hidden layer are $\tanh$ or ReLU. The choice of activation function for the output layer, such as sigmoid or softmax, depends on the specific task. For binary classification tasks like name entity recognition, a sigmoid function is often used.

To develop more intricate RNN architectures, it's helpful to simplify the notation. The generalised simplified RNN equation for the hidden layer activation is:

a^{<t>} = g(W_{aa}a^{<t-1>} + W_{ax}x^{<t>} + b_a)

The advantage of this notation is that rather than carrying around two parameter matrices, $W_{aa}$ and $W_{ax}$ , we can compress them into just one parameter matrix $W_a$ ; this way we can simplify our notation for when we develop more complex models.

a^{<t>} = g(W_{a}[a^{<t-1>}, x^{<t>}] + b_a)

Here, $W_a$ is a concatenation of $W_{aa}$ and $W_{ax}$ stacked horizontally:

{W}_{a} = [W_{aa}, W_{ax}]

Conversely, $[a^{<t-1>}, x^{<t>}]$ is $a^{<t-1>}$ and $x^{<t>}$ stacked vertically:

[a^{<t-1>}, x^{<t>}] = \begin{bmatrix} a^{<t-1>} \\ x^{<t>} \end{bmatrix}

For instance, if $a^{<t-1>}$ has a dimensionality of 100 (representing the number of hidden nodes, hence a shape of $(100, 1)$ ), and $x^{<t>}$ represents the input data at position $t$ with a dimensionality corresponding to the training examples $m$ (e.g., $(10,000, 1)$ for $m = 10,000$ ), then $W_{aa}$ would be $(100, 100)$ , and $W_{ax}$ would be $(100, 10,000)$ . Stacking $W_{aa}$ and $W_{ax}$ horizontally gives us $W_a$ :

\begin{align*} 100 \begin{bmatrix} W_{aa} : & W_{ax} \end{bmatrix} \\ 100 : 10000 \end{align*}

The shape of $W_a$ is $(\text{NoOfHiddenNeurons}, n_x + \text{NoOfHiddenNeurons})$ .
The shape of $[a^{<t-1>}, x^{<t>}]$ is $(\text{NoOfHiddenNeurons} + n_x, 1)$ .

Backpropagation Through Time (BPTT)

Backpropagation in RNNs, also known as Backpropagation Through Time (BPTT), is usually done by DL frameworks automatically for you, but it's useful to know how it works in RNNs. Here is the graph:

RNN Backpropagation Illustration

In this architecture, weights such as $W_a$ , $b_a$ , $w_y$ , and $b_y$ are shared across each timestep in a sequence. For our loss function, we employ the cross-entropy loss, defining it first for an individual timestep as:

\mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>}) = - y^{<t>} \log \hat{y}^{<t>} - (1 - y^{<t>}) \cdot \log (1 - \hat{y}^{<t>})

The total loss for the sequence is the summation of individual losses:

\mathcal{L}(\hat{y}, y) = \sum_{t=1}^{T_y} \mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>})

With the losses added in:

RNN with Losses

BPTT involves passing the activations and gradients backward through the network, conceptually moving in reverse through time.

Different Types of RNN Architectures

We've previously explored an RNN architecture where the number of inputs $T_x$ equals the number of outputs $T_y$ . However, varying applications necessitate different RNN structures. The following types are inspired by Andrej Karpathy's blog post:

One to Many: Suitable for tasks like music generation where a single input generates a sequence.
Many to One: Used in applications like sentiment analysis where a $X$ is a text while $Y$ is an integer that rangers from 1 to 5 for example.
Many to Many: This type applies when $T_x = T_y$ , such as in named entity recognition.

Various RNN Architectures

A One to Many architecture application would be music generation.

Note that, starting from the second layer, we are feeding the generated output back to the network.

In Many to Many tasks like machine translation (or even gene translation where $T_x$ and $T_y$ don't have to be the same), where input and output lengths may differ, we utilize an encoder-decoder structure:

Encoder-Decoder RNN Architecture

Here, the encoder condenses the input sequence into a context vector that the decoder uses to produce the output sequence.

Summary of RNN Types

More complex architectures, like the attention mechanism, allow the model to focus on specific parts of the input sequence when generating each word in the output sequence.

Language Models and Sequence Generation

Language models assign probabilities to sequences of words, aiding in tasks like speech recognition. For example, distinguishing between phonetically similar phrases:

"The apple and pair salad."
"The apple and pear salad."

A language model assesses the likelihood of each sequence to guide the speech recognition system.

Building Language Models with RNNs

Creating a language model with RNNs involves:

Gathering a corpus (NLP terminology for a large set) of text in the target language.
Tokenizing the corpus to create a vocabulary, including tokens for end of sentence (<EOS>) and unknown words (<UKN>). We then one-hot encoding each word.
Training the RNN to predict the next word in a sequence, giving us the probability of the word given the previous sequence:

P(y^{<t>} | y^{<1>}, y^{<2>}, ..., y^{<t-1>})

Given the sentence "Cats average 15 hours of sleep a day.<EOS>" during training for example:

This computes the probability, given the previous words of the next word being something in the dictionary. at the $t$ -th timestep, the RNN is estimating the probability as given above.

The RNN's loss function is defined by cross-entropy loss:

\mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>}) = - \sum_{i} y_i^{<t>} \log \hat{y}_i^{<t>}

Summed over all time steps:

\mathcal{L} = \sum_{t} \mathcal{L}^{<t>}(\hat{y}^{<t>}, y^{<t>})

Where $i$ is for all elements in the corpus, $t$ is for all timesteps. Using the trained model, we can:

Predict the chance of next word by feeding a sequence into the RNN and getting the final $y^{<t>}$ hot vector and sorting by maximum probability
Determine the probability of a sentence by multiplying the probabilities obtained from feeding the sentence into the RNN:

P(y^{<1>}, y^{<2>}, y^{<3>}) = P(y^{<1>}) \cdot P(y^{<2>} | y^{<1>}) \cdot P(y^{<3>} | y^{<1>}, y^{<2>})

This is simply feeding the sentence into the RNN and multiplying the probabilities (outputs)

Sampling Novel Sequences from a Trained Language Model

After training a language model on a text corpus, we can evaluate what the model has learned by sampling and generating new text sequences. Here's a step-by-step guide on how to sample a novel sequence from a trained language model:

Initialization: Start with the initial hidden state $a^{<0>}$ set to a vector of zeros, and set the initial input $x^{<1>}$ also to a vector of zeros.
First Word Sampling: From the output distribution $\hat{y}^{<1>}$ $\overset{y}{^}^{< 1 >}$ produced by the model, randomly select the first word. The numpy.random.choice() function can help with this process.
- This is the line where you get a random beginning of the sentence each time you sample run a novel sequence.
- The random sampling creates diverse starting points for generated sequences.
- This randomness introduces variability, ensuring that each sampling run can start with a different word, potentially leading to various sentence constructs.
Subsequent Sampling: For each subsequent step, pass the previously predicted word and the updated hidden state $a^{<t>}$ back into the model to generate the next word.
Continue the Process: Repeat step 3 for a predetermined number of iterations or until the model outputs an end-of-sentence (<EOS>) token.
Handling Unknown Tokens: If desired, any unknown (<UNK>) tokens that appear in the output can be discarded or replaced to ensure the coherence of the generated text.

By following these steps, the model can produce a sequence that reflects the patterns and structure it learned during training.

Word-Level vs. Character-Level Language

There are two primary types of language models: word-level and character-level. Each has its own set of advantages and disadvantages. In the character-level language model, the vocabulary will contain [a-zA-Z0-9], punctuation, special characters and possibly token.

Pros of Character-Level Models:

Capable of generating any word, even ones not seen during training, since the model learns from sequences of characters.
There will be no <UNK> token

Cons of Character-Level Models:

Tend to produce longer sequences, which can be more challenging to manage and can increase computational costs.
Often less effective at capturing long-range dependencies compared to word-level models because the sequences are longer and more complex.
Computationally more expensive and typically more challenging to train due to the increased complexity of the sequences.

Despite these cons, character-level language models can be particularly useful in scenarios where out-of-vocabulary words are common, or in specialized domains with unique vocabularies. As computational resources continue to improve, character-level models are becoming more feasible for a broader range of applications. However, word-level models remain popular due to their efficiency and effectiveness at capturing linguistic patterns.

Vanishing Gradients and Long-Term Dependencies in RNNs

Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in "the clouds are in the sky," we don't need any further context - it's pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it's needed is small, RNNs can learn to use the past information.

The Challenge of Long-Term Dependencies

One of the problems, however, with naive RNNs is that they run into the vanishing gradient problem. An RNN that process a sequence data with the size of 10,000 time steps, has 10,000 deep layers which is very hard to optimize. Let's take an example. Consider a scenario where we need to predict the next word in a sentence:

"The cat, which already ate ..., was full."
"The cats, which already ate ..., were full."

The necessary information ("cat" or "cats") may be separated from the point of use ("was" or "were") by a long string of irrelevant words. RNNs struggle to connect such distant information due to the vanishing gradient problem (similar to Deep Neural Networks), where the gradient signal gets smaller and smaller as it backpropagates through time. Consequently, updates to the weights become negligible, making it difficult for the RNN to learn these dependencies.

Vanishing Gradient Illustration

These weights vanish, and so don't update the earlier weights efficiently. For computing the word "was", we need to compute the gradient for everything behind. Multiplying fractions tends to vanish the gradient, while multiplication of large number tends to explode it. Therefore, some of your weights may not be updated properly especially when they are dependent on downstream values.

In the problem we descried, it means that it's hard for the network to learn that the word "was" relates all the way back to "cat". So, in this case, the network won't identify the singular/plural words that gives it the right grammar form of verb was/were.

The conclusion is that RNNs aren't good in long-term dependencies. In theory, RNNs are absolutely capable of handling such long-term dependencies. A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don't seem to be able to learn them.

Vanishing vs. Exploding Gradients

Vanishing Gradients: Multiplying gradients through the network layers during backpropagation can lead to very small gradients. This means that weights in the earlier layers of the RNN do not learn effectively.
Exploding Gradients: Conversely, if the gradients are large, they can explode, leading to very large updates to the weights and potentially resulting in numerical instability (e.g., NaN values).

While exploding gradients can be detected by the appearance of NaNs in the weight updates, the vanishing gradients problem tends to be the bigger problem with RNNs when compared to the exploding gradients problem.

Solutions for Exploding and Vanishing Gradients

Solution for the Exploding gradient problem:

Gradient Clipping: This technique involves scaling down gradients when they exceed a certain threshold, preventing them from becoming too large.
Truncated Backpropagation: Instead of backpropagating through the entire sequence, we can limit the backpropagation to a fixed number of steps. This approach won't solve the vanishing gradient problem but can help manage exploding gradients. It's also not optimal because you won't update all the weights.

Solution for the Vanishing gradient problem:

Weight Initialization: Using specific initialization strategies like He initialization can help in mitigating vanishing gradients by starting the training process with appropriately scaled weights.
Echo State Networks: A type of RNN where the hidden-to-hidden weights are not trained. They rely on a fixed, random sparsity pattern, which can sometimes help with the vanishing gradient problem.
LSTM/GRU Networks: Long Short-Term Memory (LSTM) units and Gated Recurrent Units (GRUs) are designed to overcome the vanishing gradient problem by incorporating mechanisms that allow for selective remembering and forgetting of information.

Introduction Advanced RNN Architectures