Introduction

Sequence Model Motivation

Sequence Models, such as RNNs and LSTMs, have dramatically revolutionized learning from sequences. Applications of sequence data include:

Speech Recognition (Sequence to Sequence):
- $X$ : Wave sequence
- $Y$ : Text sequence
Music Generation (One to Sequence):
- $X$ : Nothing or an integer
- $Y$ : Wave sequence
Sentiment Classification (Sequence to One):
- $X$ : Text sequence
- $Y$ : Integer rating (1 to 5)
DNA Sequence Analysis (Sequence to Sequence):
- $X$ : DNA sequence
- $Y$ : DNA labels
Machine Translation (Sequence to Sequence):
- $X$ : Text sequence (in one language)
- $Y$ : Text sequence (in another language)
Video Activity Recognition (Sequence to One):
- $X$ : Video frames
- $Y$ : Activity label
Name Entity Recognition (Sequence to Sequence):
- $X$ : Text sequence
- $Y$ : Label sequence
- Useful for search engines to index different types of words within a text.

Each of these problems, with varying input and output formats, can be approached as supervised learning with labeled data $(X, Y)$ as the training set. $X$ and $Y$ can have different lengths, and sometimes only one is a sequence.

Notation

We will adopt the following notations throughout this section, taking Name Entity Recognition as our motivating example:

$X^{(1)}$ : "Harry Potter and Hermione Granger invented a new spell."
$Y^{(1)}$ : 1 1 0 1 1 0 0 0 0
- Both sequences have a length of 9.
- 1 indicates a name, while 0 indicates otherwise.
$X^{(i)<t>}$ : The $t$ -th element in the input sequence of the $i$ -th training example.
- For example, $X^{(1)<1>} = \text{"Harry"}$ , $X^{(1)<2>} = \text{"Potter"}$
$Y^{(i)<t>}$ : The $t$ -th element in the output sequence of the $i$ -th training example.
- For example, $Y^{(1)<t>} = 1$ , $Y^{(1)<2>} = 1$
$T_x^{(i)}$ : Length of the input sequence for the $i$ -th training example.
- Varies across different examples.
$T_y^{(i)}$ : Length of the output sequence for the $i$ -th training example.

Representing Words:

In NLP (Natural Language Processing), a key challenge is how to represent words. There are two main approaches:

Vocabulary List:
- Contains all target set words.
- Example: [a, ..., And, ..., Harry, ..., Potter, ..., Zulu]
  - Each word has a unique index.
  - Sorted alphabetically.
- Vocabulary sizes range from 30,000 to 50,000, with larger companies using up to a million.
- Build a vocabulary list by analyzing texts for the most frequent words.
One-Hot Encoding:
- Create a one-hot encoded vector for each word based on the vocabulary.
- Handle unknown words with a special token, such as <UKN>, in the vocabulary.

Example:

The objective is to learn a mapping from this representation of $X$ to the target output $Y$ as part of a supervised learning problem.

1D & 3D Generalizations The Recurrent Neural Network Model