Artificial Intelligence 🤖
Recurrent Neural Networks (RNNs)
Advanced RNN Architectures

Advanced RNN Architectures

The vanishing gradient problem plagues simpler RNNs. Some solutions to this are alternative architectures such as LSTMs and GRUs. In this section, we will discuss these architectures in detail.

Gated Recurrent Unit (GRU)

Gated Recurrent Units (GRUs) are a type of RNN architecture that incorporate gating mechanisms to control and manage the flow of information. These gates determine what information should be passed along to the output and what should continue to the next step of the sequence.

Basic RNN vs GRU

The basic RNN has a simple structure where the hidden state at time tt, denoted as a<t>a^{<t>}, is calculated based on the previous hidden state and the current input.

Basic RNN Unit

However, GRUs improve upon this by introducing a memory cell cc, which helps the network to better capture long-range dependencies and address vanishing gradient issues. It tells us whether to memorize something or not.

GRU Mechanism

The GRU updates its memory cell using two gates:

  1. Update Gate (Γu\Gamma_u): Determines how much of the past information needs to be passed along to the future.
  2. Relevance Gate (Γr\Gamma_r): In the full GRU, it decides how relevant the previous memory content c<t1>c^{<t-1>} is for computing the current content c~<t>\tilde{c}^{<t>}.

The candidate memory c~<t>\tilde{c}^{<t>} is computed as a combination of the current input and the previous memory cell c<t1>c^{<t-1>}, modulated by the relevance gate.

Simplified GRU Equations

In GRUs, the memory cell c<t>c^{<t>} and the hidden state a<t>a^{<t>} are typically the same. The equations for the Simple GRU are as follows:

  1. Candidate Memory: This calculates c~<t>\tilde{c}^{<t>}, which is a candidate for replacing c<t>c^{<t>}.

    c~<t>=tanh(Wc[c<t1>,x<t>]+bc)\tilde{c}^{<t>} = \tanh(W_c[c^{<t-1>}, x^{<t>}]+ b_c)
  2. Update Gate: The σ\sigma function results in a value [0,1][0,1]:

    Γu=σ(Wu[c<t1>,x<t>]+bu)\Gamma_u = \sigma(W_u[c^{<t-1>}, x^{<t>}]+ b_u)
  3. Memory Cell Update:

    c<t>=Γuc~<t>+(1Γu)c<t1>c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + (1 - \Gamma_u) \ast c^{<t-1>}

Where \ast denotes element-wise multiplication.

Simplified GRU Intuition

For most of the possible ranges of the output, Γu\Gamma_u ends up close to either 0 or 1 due to the sigmoid function. For intuition, lets think of it as a binary for now. So, we update the memory cell based on the update cell and the previous cell.

Using the example of matching subjects with verbs in sentences, the GRU can effectively remember the number (singular or plural) of the subject to use the correct form of the verb, even if there are intervening words that might distract a simpler model. Take the cat sentence example:

"The cat, which already ate ..., was full"

We will suppose that Γu\Gamma_u is 0 or 1 and is a bit that tells us if a singular word needs to be memorized. Splitting the words of the sentence and getting values of cc and Γu\Gamma_u at each position in the sentence, we will have:

WordUpdate Gate (Γu\Gamma_u)Cell Memory (cc)
The0val
Cat1new_val
which0new_val
already0new_val
...0new_val
was1 (I don't need it anymore)newer_val
full......

Schematically:

This drawing is similar to the one in this popular blog post (opens in a new tab) and makes it easier to understand GRUs and LSTMs. But looking at the equations helps a lot.

Because the update gate Γu\Gamma_u is usually a small number like 0.00001, GRUs doesn't suffer the vanishing gradient problem. Noticing that c<t>=c<t1>c^{<t>} = c^{<t-1>} in a lot of cases, the GRU doesn't suffer from the exploding gradient problem either.

In terms of matrix shapes:

  • a<t>a^{<t>} has shape (NoOfHiddenNeurons,1)(\text{NoOfHiddenNeurons}, 1)
  • c<t>c^{<t>} is the same as a<t>a^{<t>}
  • c~<t>\tilde{c}^{<t>} is the same as a<t>a^{<t>}
  • u<t>u^{<t>} is also the same dimensions of a<t>a^{<t>}

Here, if the Γu\Gamma_u is a 100-dimensional vector, what it is, is really 100-dimensional vector of bits, the value is mostly 0 and 1, that tells you of this 100-dimensional memory cell, which are the bits you want to update.

What has been descried so far is the Simplified GRU unit. Let's now describe the full one

Full GRU Equations

The full GRU contains a new Relevance Gate Γr\Gamma_r that is used to calculate the candidate cc. The gate is used to tell you how relevant c<t1>c^{<t-1>} is to c<t>c^{<t>}. This gate Γr\Gamma_r tells you how relevant c<t1>c^{<t-1>} is to computing the next candidate for c<t>c^{<t>}. This gate Γr\Gamma_r is computed pretty much as you would expect with a new parameter matrix WrW_r, and does a similar thing with input x<t>x^{<t>} plus brb_r.

The equations for the Simple GRU are as follows:

  1. Candidate Memory: This calculates c~<t>\tilde{c}^{<t>}, which is a candidate for replacing c<t>c^{<t>}, using the relevance gate Γr\Gamma_r:

    c~<t>=tanh(Wc[Γrc<t1>,x<t>]+bc)\tilde{c}^{<t>} = \tanh(W_c[\Gamma_r \ast c^{<t-1>}, x^{<t>}]+ b_c)
  2. Relevance Gate:

    Γr=σ(Wr[c<t1>,x<t>]+br)\Gamma_r = \sigma(W_r[c^{<t-1>}, x^{<t>}]+ b_r)
  3. Update Gate:

    Γu=σ(Wu[c<t1>,x<t>]+bu)\Gamma_u = \sigma(W_u[c^{<t-1>}, x^{<t>}]+ b_u)
  4. Memory Cell Update:

    c<t>=Γuc~<t>+(1Γu)c<t1>c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + (1 - \Gamma_u) \ast c^{<t-1>}

GRUs have been empirically found to be effective across a range of tasks. While there are many possible variations and architectural choices, GRUs strike a balance between complexity and capability, making them a robust choice for sequence modeling. They can be considered a middle ground between the simplicity of vanilla RNNs and the complexity of LSTMs.

Researchers have experimented extensively to arrive at the GRU design, which empirically performs well across many sequence modeling tasks. While there is still no definitive theoretical explanation for why certain architectures work better than others, the evidence from practical applications is compelling. GRUs and LSTMs are the industry standard.

Long Short-Term Memory (LSTM)

Long Short-Term Memory networks, or LSTMs, are a special kind of RNN that are capable of learning long-term dependencies. They work exceptionally well on a wide range of sequence learning problems, thanks to their design, which includes multiple gates that regulate the flow of information.

LSTMs are more powerful and general than the GRUs we looked at. In an LSTM, c<t>a<t>c^{<t>} \neq a^{<t>}, and it includes three gates to modulate the flow of information through the cell. GRUs, on the other hand, had two gates and combine the cell state and hidden state into one.

In GRU;s we had:

  • c<t>=a<t>c^{<t>} = a^{<t>}
  • An update gate Γu\Gamma_{u}
  • A relevance gate Γr\Gamma_{r}
  • A candidate cell variable c~<t>{\tilde{c}}^{< t >}

Whereas, now in the LSTM, we have:

  • c<t>a<t>c^{<t>} \neq a^{<t>}
  • Update Gate (Γu\Gamma_u): Determines what information should be updated.
  • Forget Gate (Γf\Gamma_f): Controls what information should be discarded from the cell state.
  • Output Gate (Γo\Gamma_o): Decides what the next
  • A candidate cell variable c~<t>{\tilde{c}}^{< t >}

As a recap, RNNs have the form of a chain of repeating modules. In standard RNNs, this repeating module will have a very simple structure, such as a single tanhtanh layer:

Diagram Description automatically generated

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

A LSTM neural network.

Where:

LST Notation

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged.

LSTM cell state

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by the gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

Pointwise multiplication operation

The overall LSTM diagram is as follows, but we will go through it step by step.

LSTM Diagram

Overall, we will use a<t1>a^{<t-1>} and x<t>x^{<t>} to compute 4 values, which are combined for c<t>c^{<t>} from the previous c<t1>c^{<t-1>}.

LSTM Step by Step

The first step in our LSTM is to decide what information we're going to throw away from the cell state. This decision is made by a sigmoid layer called the forget gate layer (Γf\Gamma_{f}). It looks at a<t1>a^{<t-1>} and x<t>x^{<t>}, and outputs a number between 0 and 1 for each number in the cell state c<t1>c^{< t - 1 >} . A 1 represents "completely keep this" while a 0 represents "completely get rid of this."

Let's go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

Forget:Γf=σ(Wf[a<t1>,x<t>]+bf)Forget: \Gamma_{f} = \sigma(W_{f}[a^{<t-1>}, x^{<t>}] + b_{f})

The next step is to decide what new information we're going to store in the cell state. This has two parts. First, a sigmoid layer called the update/input gate layer (Γu\Gamma_{u}) which decides which values we'll update. Next, a tanhtanh layer creates a vector of new candidate values, c~<t>\tilde{c}^{< t >}, that could be added to the state. In the next step, we'll combine these two to create an update to the state.

In the example of our language model, we'd want to add the gender of the new subject to the cell state, to replace the old one we're forgetting.

Update/Input:Γu=σ(Wu[a<t1>,x<t>]+bu)Update/Input: \Gamma_{u} = \sigma(W_{u}[a^{<t-1>}, x^{<t>}] + b_{u}) c~<t>=tanh(WC[a<t1>,x<t>]+bc)\tilde{c}^{<t>} = tanh(W_{C}[a^{<t-1>}, x^{<t>}] + b_{c})

It's now time to update the old cell state, c<t1>c^{< t - 1 >}, into the new cell state c<t>c^{< t >}. We multiply the old state by Γf\Gamma_{f}, forgetting the things we decided to forget earlier. Then we add (Γuc~<t>)(\Gamma_{u} \ast \tilde{c}^{<t>} ). This is the new candidate values, scaled by how much we decided to update each state value. In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

c<t>=Γuc~<t>+Γfc~<t1>c^{< t >} = \Gamma_{u} \ast \tilde{c}^{< t >} + \Gamma_{f} \ast \tilde{c}^{< t - 1 >}

Finally, we need to decide what we're going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we're going to output. Then, we put the cell state through tanhtanh (to push the values to be between 1−1 and 11) and multiply it by the output of the sigmoid gate Γo\Gamma_o, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that's what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that's what follows next.

Output:Γo=σ(Wo[a<t1>,x<t>]+bo)Output: \Gamma_o = \sigma(W_o[a^{<t-1>}, x^{<t>}] + b_o) a<t>=Γotanh(c<t>)a^{<t>} = \Gamma_o \ast tanh(c^{<t>})

Here are the equations of an LSTM unit alongside those we saw for the GRU:

LSTMGRU
Candidate cell state: c~<t>=tanh(Wc[a<t1>,x<t>]+bc)\tilde{c}^{<t>} = \tanh(W_c [a^{<t-1>}, x^{<t>}] + b_c)Candidate hidden state: c~<t>=tanh(Wc[Γra<t1>,x<t>]+bc)\tilde{c}^{<t>} = \tanh(W_c [\Gamma_r \ast a^{<t-1>}, x^{<t>}] + b_c)
Update gate: Γu=σ(Wu[a<t1>,x<t>]+bu)\Gamma_u = \sigma(W_u [a^{<t-1>}, x^{<t>}] + b_u)Update gate: Γu=σ(Wu[a<t1>,x<t>]+bu)\Gamma_u = \sigma(W_u [a^{<t-1>}, x^{<t>}] + b_u)
Forget gate: Γf=σ(Wf[a<t1>,x<t>]+bf)\Gamma_f = \sigma(W_f [a^{<t-1>}, x^{<t>}] + b_f)Relevance gate: Γr=σ(Wr[a<t1>,x<t>]+br)\Gamma_r = \sigma(W_r [a^{<t-1>}, x^{<t>}] + b_r)
Output gate: Γo=σ(Wo[a<t1>,x<t>]+bo)\Gamma_o = \sigma(W_o [a^{<t-1>}, x^{<t>}] + b_o)Not applicable in GRU
Cell state update: c<t>=Γuc~<t>+Γfc<t1>c^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + \Gamma_f \ast c^{<t-1>}Hidden state update: a<t>=Γuc~<t>+(1Γu)a<t1>a^{<t>} = \Gamma_u \ast \tilde{c}^{<t>} + (1 - \Gamma_u) \ast a^{<t-1>}
Hidden state output: a<t>=Γotanh(c<t>)a^{<t>} = \Gamma_o \ast \tanh(c^{<t>})Hidden state output: a<t>=c<t>a^{<t>} = c^{<t>} (since c<t>=a<t>c^{<t>} = a^{<t>} in GRU)

LSTM Variants

While the standard LSTM architecture is powerful, there are variants such as LSTM with peephole connections introduced by Gers & Schmidhuber (2000), which allow the gate layers to look at the cell state directly, potentially improving the ability of the gate to make more precise decisions. This is the normal LSTM with c<t1>c^{< t - 1 >} included with every gate.

LSTM with peephole connections

There isn't a universal superior between LSTM and it's variants. One of the advantages of the GRU is that it's simpler and can be used to build much bigger network but the LSTM is more powerful and general.

Understanding LSTM Functionality

The sigmoid function in the gates of an LSTM outputs values between 0 and 1, dictating how much of each component should be allowed to pass through. The LSTM's design with three gates ensures a fine-grained approach to managing the flow of information, protecting the cell state, and thereby effectively addressing the vanishing gradient problem that plagues simpler RNNs.

Bidirectional Recurrent Neural Networks (BiRNNs)

Bidirectional Recurrent Neural Networks, or BiRNNs, enhance the ability of sequence models by incorporating information from both past (backward) and future (forward) states. The key idea is that the output at each time step of a BiRNN is determined by two hidden states, one for the forward direction and one for the backward direction.

How BiRNNs Work

BiRNNs are particularly useful in tasks where the context in both directions is necessary to make an accurate prediction. For instance, in named entity recognition, the word following a name can provide valuable context that helps to identify it as a name.Take these sentences:

"He said "Teddy Roosevelt was a great President""

"He said "Teddy bears are on sale!""

The name Teddy cannot be learned from He and said, but can be learned from bears. BiRNNs fix this issue. Here's a graphical representation of a BiRNN:

BiRNN Architecture

BiRNNs are acyclic graphs. The blocks here can be any RNN block including the basic RNNs, LSTMs, or GRUs. In this diagram, information flows in two directions:

  • Forward Pass: Processes the sequence from the first element to the last (left to right in the diagram).
  • Backward Pass: Processes the sequence from the last element to the first (right to left in the diagram).

The final prediction at each time step y^<t>\hat{y}^{<t>} uses the concatenated hidden states from both directions. This dual-direction processing allows the network to capture patterns that may be invisible in a unidirectional context.

Applications and Limitations

BiRNNs are commonly used in natural language processing (NLP) tasks, such as machine translation and text classification, where understanding the full context of a sequence is crucial. However, BiRNNs require that the entire input sequence be available before processing, which makes them unsuitable for real-time applications like live speech recognition.

For a lot of NLP or text processing problems, a BiRNN with LSTM appears to be commonly used.

Deep Recurrent Neural Networks

While a standard RNN has a single recurrent layer, a deep RNN consists of multiple RNN layers stacked on top of each other, adding levels of abstraction.

Structure of Deep RNNs

Here is an illustration of a deep RNN with three layers:

Deep RNN Structure

Each layer's hidden state is passed to both the next layer at the same time step and the same layer at the next time step. This structure allows the network to learn more complex representations, but it also significantly increases the computational complexity.

Training and Use Cases

Deep RNNs can be challenging to train due to the increased depth, which can exacerbate issues like vanishing or exploding gradients. However, for complex sequence modeling tasks where additional layers of abstraction are beneficial, deep RNNs can outperform their shallower counterparts.

In practice, the depth of RNNs is usually much more modest compared to feed-forward deep networks (there could be 100 or even 200 layers); even a network with three recurrent layers is considered quite deep. Occasionally, additional feed-forward layers may be included within or after the recurrent layers to further process the information before the final output.


Resourses: