An Introduction to Recurrent Neural Networks for Beginners

Recurrent Neural Networks (RNNs) are a kind of neural network that specialize in processing sequences . They’re often used in Natural Language Processing (NLP) tasks because of their effectiveness in handling text. In this post, we’ll explore what RNNs are, understand how they work, and build a real one from scratch (using only numpy ) in Python.

This post assumes a basic knowledge of neural networks. My introduction to Neural Networks covers everything you’ll need to know, so I’d recommend reading that first.

Let’s get into it!

1. The Why

One issue with vanilla neural nets (and alsoCNNs) is that they only work with pre-determined sizes: they take fixed-size inputs and produce fixed-size outputs . RNNs are useful because they let us have variable-length sequences as both inputs and outputs. Here are a few examples of what RNNs can look like:

Inputs are red, the RNN itself is green, and outputs are blue. Source: Andrej Karpathy

This ability to process sequences makes RNNs very useful. For example:

Machine Translation (e.g. Google Translate) is done with “many to many” RNNs. The original text sequence is fed into an RNN, which then produces translated text as output.
Sentiment Analysis (e.g. Is this a positive or negative review? ) is often done with “many to one” RNNs. The text to be analyzed is fed into an RNN, which then produces a single output classification (e.g. This is a positive review ).

Later in this post, we’ll build a “many to one” RNN from scratch to perform basic Sentiment Analysis.

2. The How

Let’s consider a “many to many” RNN with inputs x0,x1,…xnx_0, x_1, \ldots x_n x 0 , x 1 , … x n that wants to produce outputs y0,y1,…yny_0, y_1, \ldots y_n y 0 , y 1 , … y n . These xix_i x i and yiy_i y i are vectors and can have arbitrary dimensions.

RNNs work by iteratively updating a hidden state hh h , which is a vector that can also have arbitrary dimension. At any given step tt t ,

The next hidden state hth_t h t is calculated using the previous hidden state ht−1h_{t-1} h t − 1 and the next input xtx_t x t .
The next output yty_t y t is calculated using hth_t h t .

A many to many RNN

Here’s what makes a RNN recurrent : it uses the same weights for each step . More specifically, a typical vanilla RNN uses only 3 sets of weights to perform its calculations:

WxhW_{xh} W x h , used for all xtx_t x t → hth_t h t links.
WhhW_{hh} W h h , used for all ht−1h_{t-1} h t − 1 → hth_t h t links.
WhyW_{hy} W h y , used for all hth_t h t → yty_t y t links.

We’ll also use two biases for our RNN:

bhb_h b h , added when calculating hth_t h t .
byb_y b y , added when calculating yty_t y t .

We’ll represent the weights as matrices and the biases as vectors . These 3 weights and 2 biases make up the entire RNN!

Here are the equations that put everything together:

ht=tanh⁡(Wxhxt+Whhht−1+bh)h_t = \tanh (W_{xh} x_t + W_{hh} h_{t-1} + b_h) tanh ( W x h x t + W h h h t − 1 + b h )

yt=Whyht+byy_t = W_{hy} h_t + b_y W h y h t + b y

Don't skim over these equations. Stop and stare at this for a minute. Also, remember that the weights are matrices and the other variables are vectors .

All the weights are applied using matrix multiplication, and the biases are added to the resulting products. We then use tanh as an activation function for the first equation (but other activations like sigmoid can also be used).

No idea what an activation function is? Read my introduction to Neural Networks like I mentioned. Seriously.

3. The Problem

Let’s get our hands dirty! We’ll implement an RNN from scratch to perform a simple Sentiment Analysis task: determining whether a given text string is positive or negative.

Here are a few samples from the small dataset I put together for this post:

Text Positive? i am good ✓ i am bad :x: this is very good ✓ this is not bad ✓ i am bad not good :x: i am not at all happy :x: this was good earlier ✓ i am not at all bad or sad right now ✓

4. The Plan

Since this is a classification problem, we’ll use a “many to one” RNN. This is similar to the “many to many” RNN we discussed earlier, but it only uses the final hidden state to produce the one output yy y :

A many to one RNN

Each xix_i x i will be a vector representing a word from the text. The output yy y will be a vector containing two numbers, one representing positive and the other negative. We’ll applySoftmax to turn those values into probabilities and ultimately decide between positive / negative.

Let’s start building our RNN!

5. The Pre-Processing

The dataset I mentioned earlier consists of two Python dictionaries:

data.py

train_data = {
  'good': True,
  'bad': False,
  # ... more data
}

test_data = {
  'this is happy': True,
  'i am good': True,
  # ... more data
}

True = Positive, False = Negative

We’ll have to do some pre-processing to get the data into a usable format. To start, we’ll construct a vocabulary of all words that exist in our data:

main.py

from data import train_data, test_data

# Create the vocabulary.
vocab = list(set([w for text in train_data.keys() for w in text.split(' ')]))
vocab_size = len(vocab)
print('%d unique words found' % vocab_size) # 18 unique words found

vocab now holds a list of all words that appear in at least one training text. Next, we’ll assign an integer index to represent each word in our vocab.

main.py

# Assign indices to each word.
word_to_idx = { w: i for i, w in enumerate(vocab) }
idx_to_word = { i: w for i, w in enumerate(vocab) }
print(word_to_idx['good']) # 16 (this may change)
print(idx_to_word[0]) # sad (this may change)

We can now represent any given word with its corresponding integer index! This is necessary because RNNs can’t understand words - we have to give them numbers.

Finally, recall that each input xix_i x i to our RNN is a vector . We’ll use one-hot vectors , which contain all zeros except for a single one. The “one” in each one-hot vector will be at the word’s corresponding integer index.

Since we have 18 unique words in our vocabulary, each xix_i x i will be a 18-dimensional one-hot vector.

main.py

import numpy as np

def createInputs(text):
  '''
  Returns an array of one-hot vectors representing the words
  in the input text string.
  - text is a string
  - Each one-hot vector has shape (vocab_size, 1)
  '''
  inputs = []
  for w in text.split(' '):
    v = np.zeros((vocab_size, 1))
    v[word_to_idx[w]] = 1
    inputs.append(v)
  return inputs

We’ll use createInputs() later to create vector inputs to pass in to our RNN.

6. The Forward Phase

It’s time to start implementing our RNN! We’ll start by initializing the 3 weights and 2 biases our RNN needs:

rnn.py

import numpy as np
from numpy.random import randn

class RNN:
  # A Vanilla Recurrent Neural Network.

  def __init__(self, input_size, output_size, hidden_size=64):
    # Weights
    self.Whh = randn(hidden_size, hidden_size) / 1000
    self.Wxh = randn(hidden_size, input_size) / 1000
    self.Why = randn(output_size, hidden_size) / 1000

    # Biases
    self.bh = np.zeros((hidden_size, 1))
    self.by = np.zeros((output_size, 1))

Note: We're dividing by 1000 to reduce the initial variance of our weights. This is not the best way to initialize weights, but it's simple and works for this post.

We use np.random.randn() to initialize our weights from the standard normal distribution.

Next, let’s implement our RNN’s forward pass. Remember these two equations we saw earlier?

ht=tanh⁡(Wxhxt+Whhht−1+bh)h_t = \tanh (W_{xh} x_t + W_{hh} h_{t-1} + b_h) tanh ( W x h x t + W h h h t − 1 + b h )

yt=Whyht+byy_t = W_{hy} h_t + b_y W h y h t + b y

Here are those same equations put into code:

rnn.py

class RNN:
  # ...

  def forward(self, inputs):
    '''
    Perform a forward pass of the RNN using the given inputs.
    Returns the final output and hidden state.
    - inputs is an array of one hot vectors with shape (input_size, 1).
    '''
    h = np.zeros((self.Whh.shape[0], 1))

    # Perform each step of the RNN
    for i, x in enumerate(inputs):
      h = np.tanh(self.Wxh @ x + self.Whh @ h + self.bh)

    # Compute the output
    y = self.Why @ h + self.by

    return y, h

Pretty simple, right? Note that we initialized initialized hh h to the zero vector for the first step, since there’s no previous hh h we can use at that point.

Let’s try it out:

main.py

# ...

def softmax(xs):
  # Applies the Softmax Function to the input array.
  return np.exp(xs) / sum(np.exp(xs))

# Initialize our RNN!
rnn = RNN(vocab_size, 2)

inputs = createInputs('i am very good')
out, h = rnn.forward(inputs)
probs = softmax(out)
print(out) # [[0.50000095], [0.49999905]]

Need a refresher on Softmax? Read my quick explanation of Softmax .

Our RNN works, but it’s not very useful yet. Let’s change that…

7. The Backward Phase

In order to train our RNN, we first need a loss function. We’ll use cross-entropy loss , which is often paired with Softmax. Here’s how we calculate it:

L=−ln⁡(pc)L = -\ln (p_c) − ln ( p c )

where pcp_c p c is our RNN’s predicted probability for the correct class (positive or negative). For example, if a positive text is predicted to be 90% positive by our RNN, the loss is:

L=−ln⁡(0.90)=0.105L = -\ln(0.90) = 0.105 − ln ( 0 . 9 0 ) = 0 . 1 0 5

Want a longer explanation? Read theCross-Entropy Loss section of my introduction to Convolutional Neural Networks (CNNs).

Now that we have a loss, we’ll train our RNN using gradient descent to minimize loss. That means it’s time to derive some gradients!

:warning: The following section assumes a basic knowledge of multivariable calculus . You can skip it if you want, but I recommend giving it a skim even if you don’t understand much. We’ll incrementally write code as we derive results , and even a surface-level understanding can be helpful.

If you want some extra background for this section, I recommend first reading the Training a Neural Network section of my introduction to Neural Networks. Also, all of the code for this post is on Github , so you can follow along there if you’d like.

Ready? Here we go.

7.1 Definitions

First, some definitions:

Let yy y represent the raw outputs from our RNN.
Let pp p represent the final probabilities: p=softmax(y)p = \text{softmax}(y) softmax ( y ) .
Let cc c refer to the true label of a certain text sample, a.k.a. the “correct” class.
Let LL L be the cross-entropy loss: L=−ln⁡(pc)L = -\ln(p_c) − ln ( p c ) .
Let WxhW_{xh} W x h , WhhW_{hh} W h h , and WhyW_{hy} W h y be the 3 weight matrices in our RNN.
Let bhb_h b h and byb_y b y be the 2 bias vectors in our RNN.

7.2 Setup

Next, we need to edit our forward phase to cache some data for use in the backward phase. While we’re at it, we’ll also setup the skeleton for our backwards phase. Here’s what that looks like:

rnn.py

class RNN:
  # ...

  def forward(self, inputs):
    '''
    Perform a forward pass of the RNN using the given inputs.
    Returns the final output and hidden state.
    - inputs is an array of one hot vectors with shape (input_size, 1).
    '''
    h = np.zeros((self.Whh.shape[0], 1))

    self.last_inputs = inputs    self.last_hs = { 0: h }
    # Perform each step of the RNN
    for i, x in enumerate(inputs):
      h = np.tanh(self.Wxh @ x + self.Whh @ h + self.bh)
      self.last_hs[i + 1] = h
    # Compute the output
    y = self.Why @ h + self.by

    return y, h

  def backprop(self, d_y, learn_rate=2e-2):    '''    Perform a backward pass of the RNN.    - d_y (dL/dy) has shape (output_size, 1).    - learn_rate is a float.    '''    pass

Curious about why we’re doing this caching? Read my explanation in theTraining Overview of my introduction to CNNs, in which we do the same thing.

7.3 Gradients

It’s math time! We’ll start by calculating ∂L∂y\frac{\partial L}{\partial y} . We know:

L=−ln⁡(pc)=−ln⁡(softmax(yc))L = -\ln(p_c) = -\ln(\text{softmax}(y_c)) − ln ( p c ) = − ln ( softmax ( y c ) )

I’ll leave the actual derivation of ∂L∂y\frac{\partial L}{\partial y} using the Chain Rule as an exercise for you :wink:, but the result comes out really nice:

∂L∂yi={piif i≠cpi−1if i=c\frac{\partial L}{\partial y_i} = \begin{cases} p_i & \text{if $i \neq c$} \\ p_i - 1 & \text{if $i = c$} \\ \end{cases} ∂ y i ∂ L = { p i p i − 1 if i = c if i = c

For example, if we have p=[0.2,0.2,0.6]p = [0.2, 0.2, 0.6] [ 0 . 2 , 0 . 2 , 0 . 6 ] and the correct class is c=0c = 0 0 , then we’d get ∂L∂y=[−0.8,0.2,0.6]\frac{\partial L}{\partial y} = [-0.8, 0.2, 0.6] [ − 0 . 8 , 0 . 2 , 0 . 6 ] . This is also quite easy to turn into code:

main.py

# Loop over each training example
for x, y in train_data.items():
  inputs = createInputs(x)
  target = int(y)

  # Forward
  out, _ = rnn.forward(inputs)
  probs = softmax(out)

  # Build dL/dy
  d_L_d_y = probs  d_L_d_y[target] -= 1
  # Backward
  rnn.backprop(d_L_d_y)

Nice. Next up, let’s take a crack at gradients for WhyW_{hy} W h y and byb_y b y , which are only used to turn the final hidden state into the RNN’s output. We have:

∂L∂Why=∂L∂y∗∂y∂Why\frac{\partial L}{\partial W_{hy}} = \frac{\partial L}{\partial y} * \frac{\partial y}{\partial W_{hy}} ∂ W h y ∂ L = ∂ W h y ∂ y

y=Whyhn+byy = W_{hy} h_n + b_y W h y h n + b y

where hnh_n h n is the final hidden state. Thus,

∂y∂Why=hn\frac{\partial y}{\partial W_{hy}} = h_n ∂ W h y ∂ y = h n

∂L∂Why=∂L∂yhn\frac{\partial L}{\partial W_{hy}} = \boxed{\frac{\partial L}{\partial y} h_n} ∂ W h y ∂ L = h n

Similarly,

∂y∂by=1\frac{\partial y}{\partial b_y} = 1 ∂ b y ∂ y = 1

∂L∂by=∂L∂y\frac{\partial L}{\partial b_y} = \boxed{\frac{\partial L}{\partial y}} ∂ b y ∂ L =

We can now start implementing backprop() !

rnn.py

class RNN:
  # ...

  def backprop(self, d_y, learn_rate=2e-2):
    '''
    Perform a backward pass of the RNN.
    - d_y (dL/dy) has shape (output_size, 1).
    - learn_rate is a float.
    '''
    n = len(self.last_inputs)

    # Calculate dL/dWhy and dL/dby.
    d_Why = d_y @ self.last_hs[n].T    d_by = d_y

Reminder: We created self.last_hs in forward() earlier.

Finally, we need the gradients for WhhW_{hh} W h h , WxhW_{xh} W x h , and bhb_h b h , which are used every step during the RNN. We have:

∂L∂Wxh=∂L∂y∑t∂y∂ht∗∂ht∂Wxh\frac{\partial L}{\partial W_{xh}} = \frac{\partial L}{\partial y} \sum_t \frac{\partial y}{\partial h_t} * \frac{\partial h_t}{\partial W_{xh}} ∂ W x h ∂ L = t ∑ ∂ h t ∂ y ∗ ∂ W x h ∂ h t

because changing WxhW_{xh} W x h affects every hth_t h t , which all affect yy y and ultimately LL L . In order to fully calculate the gradient of WxhW_{xh} W x h , we’ll need to backpropagate through all timesteps, which is known as Backpropagation Through Time (BPTT):

Backpropagation Through Time

WxhW_{xh} W x h is used for all xtx_t x t → hth_t h t forward links, so we have to backpropagate back to each of those links.

Once we arrive at a given step tt t , we need to calculate ∂ht∂Wxh\frac{\partial h_t}{\partial W_{xh}} ∂ W x h ∂ h t :

ht=tanh⁡(Wxhxt+Whhht−1+bh)h_t = \tanh (W_{xh} x_t + W_{hh} h_{t-1} + b_h) tanh ( W x h x t + W h h h t − 1 + b h )

The derivative of tanh⁡\tanh tanh is well-known:

dtanh⁡(x)dx=1−tanh⁡2(x)\frac{d \tanh(x)}{dx} = 1 - \tanh^2(x) d x d tanh ( x ) = tanh 2 ( x )

We use Chain Rule like usual:

∂ht∂Wxh=(1−ht2)xt\frac{\partial h_t}{\partial W_{xh}} = \boxed{(1 - h_t^2) x_t} ∂ W x h ∂ h t = ( 1 − h t 2 ) x t

Similarly,

∂ht∂Whh=(1−ht2)ht−1\frac{\partial h_t}{\partial W_{hh}} = \boxed{(1 - h_t^2) h_{t-1}} ∂ W h h ∂ h t = ( 1 − h t 2 ) h t − 1

∂ht∂bh=(1−ht2)\frac{\partial h_t}{\partial b_h} = \boxed{(1 - h_t^2)} ∂ b h ∂ h t = ( 1 − h t 2 )

The last thing we need is ∂y∂ht\frac{\partial y}{\partial h_t} ∂ h t ∂ y . We can calculate this recursively:

∂y∂ht=∂y∂ht+1∗∂ht+1∂ht=∂y∂ht+1(1−ht2)Whh\begin{aligned} \frac{\partial y}{\partial h_t} &= \frac{\partial y}{\partial h_{t+1}} * \frac{\partial h_{t+1}}{\partial h_t} \\ &= \frac{\partial y}{\partial h_{t+1}} (1 - h_t^2) W_{hh} \\ \end{aligned} ∂ h t ∂ y = ∂ h t + 1 ∂ y ∗ ∂ h t ∂ h t + 1 = ∂ h t + 1 ∂ y ( 1 −

1. The Why

2. The How

Here are the equations that put everything together:

3. The Problem

4. The Plan

5. The Pre-Processing

data.py

main.py

main.py

main.py

6. The Forward Phase

rnn.py

rnn.py

main.py

7. The Backward Phase

7.1 Definitions

7.2 Setup

rnn.py

7.3 Gradients

main.py

rnn.py

Recommend

Introduction to dtplyr

Docker入门 - 如何创建你的第一个Docker应用

二叉查找树（BST）

编写第一个 Flutter 应用（第一篇）

聊聊单机房故障自愈中的经济学：投资与收益

FIBOS 链上资源模型介绍

Visual Studio Code C/C++ Extension: July 2019 Update

FIBOS 超级节点选举以及提案多签介绍

Introducing AWS Chatbot: ChatOps for AWS

Animate React with Framer Motion

About Joyk