Bidirectional LSTM in NLP

In this article, we will first discuss bidirectional LSTMs and their architecture. We will then look into the implementation of a review system using Bidirectional LSTM. Finally, we will conclude this article while discussing the applications of bidirectional LSTM.

Bidirectional LSTM (BiLSTM)

Bidirectional LSTM or BiLSTM is a term used for a sequence model which contains two LSTM layers, one for processing input in the forward direction and the other for processing in the backward direction. It is usually used in NLP-related tasks. The intuition behind this approach is that by processing data in both directions, the model is able to better understand the relationship between sequences (e.g. knowing the following and preceding words in a sentence).

To better understand this let us see an example. The first statement is “Server can you bring me this dish” and the second statement is “He crashed the server”. In both these statements, the word server has different meanings and this relationship depends on the following and preceding words in the statement. The bidirectional LSTM helps the machine to understand this relationship better than compared with unidirectional LSTM. This ability of BiLSTM makes it a suitable architecture for tasks like sentiment analysis, text classification, and machine translation.

Architecture

The architecture of bidirectional LSTM comprises of two unidirectional LSTMs which process the sequence in both forward and backward directions. This architecture can be interpreted as having two separate LSTM networks, one gets the sequence of tokens as it is while the other gets in the reverse order. Both of these LSTM network returns a probability vector as output and the final output is the combination of both of these probabilities. It can be represented as:

*** QuickLaTeX cannot compile formula:
p_t = p_t^f + p_t^b




*** Error message:
Cannot connect to QuickLaTeX server: cURL error 35: Unknown SSL protocol error in connection to www.quicklatex.com:443 
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37

where,

*** QuickLaTeX cannot compile formula:
p_t 




*** Error message:
Cannot connect to QuickLaTeX server: cURL error 52: Empty reply from server
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37

: Final probability vector of the network.

: Probability vector from the forward LSTM network.
: Probability vector from the backward LSTM network.

Bidirectional LSTM layer Architecture

Figure 1 describes the architecture of the BiLSTM layer where is the input token,

*** QuickLaTeX cannot compile formula:
Y_i 




*** Error message:
Cannot connect to QuickLaTeX server: cURL error 35: Unknown SSL protocol error in connection to www.quicklatex.com:443 
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37

is the output token, and and are LSTM nodes. The final output of

*** QuickLaTeX cannot compile formula:
Y_i 




*** Error message:
Cannot connect to QuickLaTeX server: cURL error 52: Empty reply from server
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37

is the combination of and LSTM nodes.

Now, let us look into an implementation of a review system using BiLSTM layers in Python using the Tensorflow library. We would be performing sentiment analysis on the IMDB movie review dataset. We would implement the network from scratch and train it to identify if the review is positive or negative.

Importing Libraries and Dataset

Python libraries make it very easy for us to handle the data and perform typical and complex tasks with a single line of code.

Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
Matplotlib– This library is used to draw visualizations.
TensorFlow – This is an open-source library that is used for Machine Learning and Artificial intelligence and provides a range of functions to achieve complex functionalities with single lines of code.

Python3

import tensorflow as tf

import tensorflow_datasets as tfds

import numpy as np

import matplotlib.pyplot as plt

IMDB movies review dataset is the dataset for binary sentiment classification containing 25,000 highly polar movie reviews for training, and 25,000 for testing. This dataset can be acquired from this website or we can also use the tensorflow_datasets library to acquire it.

Python3

# Obtain the imdb review dataset from tensorflow datasets

dataset = tfds.load('imdb_reviews', as_supervised=True)

# Seperate test and train datasets

train_dataset, test_dataset = dataset['train'], dataset['test']

# Split the test and train data into batches of 32

# and shuffling the training set

batch_size = 32

train_dataset = train_dataset.shuffle(10000)

train_dataset = train_dataset.batch(batch_size)

test_dataset = test_dataset.batch(batch_size)

Printing a sample review and its label from the training set.

Python3

example, label = next(iter(train_dataset))

print('Text:\n', example.numpy()[0])

print('\nLabel: ', label.numpy()[0])

Output:

Text:
 b'Stumbling upon this HBO special late one night, I was absolutely taken by this 
 attractive British "executive transvestite." I have never laughed so hard over 
 European History or any of the other completely worthwhile point Eddie Izzard made.
  I laughed so much that I woke up my mother sleeping at the other end of the house...'
Label:  1

Model Architecture

In this section, we will define the model we will use for sentiment analysis. The initial layer of this architecture is the text vectorization layer, responsible for encoding the input text into a sequence of token indices. These tokens are subsequently fed into the embedding layer, where each word is assigned a trainable vector. After enough training, these vectors tend to adjust themselves such that words with similar meanings have similar vectors. This data is then passed to Bidirectional LSTM layers which process these sequences and finally convert it to a single logit as the classification output.

We will first perform text vectorization and let the encoder map all the words in the training dataset to a token. We can also see in the example below how we can encode and decode the sample review into a vector of integers.

Python3

# Using the TextVectorization layer to normalize, split, and map strings

# to integers.

encoder = tf.keras.layers.TextVectorization(max_tokens=10000)

encoder.adapt(train_dataset.map(lambda text, _: text))

# Extracting the vocabulary from the TextVectorization layer.

vocabulary = np.array(encoder.get_vocabulary())

# Encoding a test example and decoding it back.

original_text = example.numpy()[0]

encoded_text = encoder(original_text).numpy()

decoded_text = ' '.join(vocabulary[encoded_text])

print('original: ', original_text)

print('encoded: ', encoded_text)

print('decoded: ', decoded_text)

Output:

original: 
 b'Stumbling upon this HBO special late one night, I was absolutely taken by this 
 attractive British "executive transvestite." I have never laughed so hard over 
 European History or any of the other completely worthwhile point Eddie Izzard made. 
 I laughed so much that I woke up my mother sleeping at the other end of the house...'
encoded: 
 [9085  720   11 4335  309  534   29  311   10   14  412  602   33   11
 1523  683 3505    1   10   26  110 1434   38  264  126 1835  489   42
   99    5    2   81  325 2601  215 1781 9352   91   10 1434   38   73
   12   10 9259   58   56  462 2703   31    2   81  129    5    2  313]
decoded: 
 stumbling upon this hbo special late one night i was absolutely taken by this 
 attractive british executive [UNK] i have never laughed so hard over european history
  or any of the other completely worthwhile point eddie izzard made i laughed so much 
  that i woke up my mother sleeping at the other end of the house

Now, we will use this trained encoder along with Bidirectional LSTM layers to define a model as discussed earlier.

We will implement a Sequential model which will contain the following parts:

First layer is the embedding layer used to create a embedding for the inpurt text.
Then bidirectional LSTM layers in the network to learn greater dependencies in the network.
Then we will have two fully connected layers whose final output will be teh probability of being the positive review.

Python3

# Creating the model

model = tf.keras.Sequential([

encoder,

tf.keras.layers.Embedding(

len(encoder.get_vocabulary()), 64, mask_zero=True),

tf.keras.layers.Bidirectional(

tf.keras.layers.LSTM(64, return_sequences=True)),

tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),

tf.keras.layers.Dense(64, activation='relu'),

tf.keras.layers.Dense(1)

])

# Summary of the model

model.summary()

# Compile the model

model.compile(

loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),

optimizer=tf.keras.optimizers.Adam(),

metrics=['accuracy']

)

Output:

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     

embedding (Embedding)       (None, None, 64)          640000    

bidirectional (Bidirectiona  (None, None, 128)        66048     
 l)                                                              

bidirectional_1 (Bidirectio  (None, 64)               41216     
 nal)                                                            

dense (Dense)               (None, 64)                4160      

dense_1 (Dense)             (None, 1)                 65        

=================================================================
Total params: 751,489
Trainable params: 751,489
Non-trainable params: 0
_________________________________________________________________

Model Training

Now, we will train the model we defined in the previous step for five epochs.

Python3

# Training the model and validating it on test set

history = model.fit(

train_dataset,

epochs=5,

validation_data=test_dataset,

)

Output:

Epoch 1/5
782/782 [==============================] - 1209s 2s/step - loss: 0.3657 - 
accuracy: 0.8266 - val_loss: 0.3110 - val_accuracy: 0.8441
Epoch 2/5
782/782 [==============================] - 1269s 2s/step - loss: 0.2147 - 
accuracy: 0.9126 - val_loss: 0.3566 - val_accuracy: 0.8590
Epoch 3/5
782/782 [==============================] - 1146s 1s/step - loss: 0.1616 - 
accuracy: 0.9380 - val_loss: 0.3764 - val_accuracy: 0.8670
Epoch 4/5
782/782 [==============================] - 1963s 3s/step - loss: 0.0962 - 
accuracy: 0.9647 - val_loss: 0.4271 - val_accuracy: 0.8564
Epoch 5/5
782/782 [==============================] - 1121s 1s/step - loss: 0.0573 - 
accuracy: 0.9796 - val_loss: 0.5516 - val_accuracy: 0.8575

Plotting the training and validation accuracy and loss plots.

Python3

# Plotting the accuracy and loss over time

# Training history

history_dict = history.history

# Seperating validation and training accuracy

acc = history_dict['accuracy']

val_acc = history_dict['val_accuracy']

# Seperating validation and training loss

loss = history_dict['loss']

val_loss = history_dict['val_loss']

# Plotting

plt.figure(figsize=(8, 4))

plt.subplot(1, 2, 1)

plt.plot(acc)

plt.plot(val_acc)

plt.title('Training and Validation Accuracy')

plt.xlabel('Epochs')

plt.ylabel('Accuracy')

plt.legend(['Accuracy', 'Validation Accuracy'])

plt.subplot(1, 2, 2)

plt.plot(loss)

plt.plot(val_loss)

plt.title('Training and Validation Loss')

plt.xlabel('Epochs')

plt.ylabel('Loss')

plt.legend(['Loss', 'Validation Loss'])

plt.show()

Output:

The plot of training and validation accuracy and loss

Model Evaluation

Now, we will test the trained model with a random review and check its output.

Python3

# Making predictions

sample_text = (

'''The movie by GeeksforGeeks was so good and the animation are so dope.

I would recommend my friends to watch it.'''

)

predictions = model.predict(np.array([sample_text]))

print(*predictions[0])

# Print the label based on the prediction

if predictions[0] > 0:

print('The review is positive')

else:

print('The review is negative')

Output:

1/1 [==============================] - 0s 33ms/step
5.414222
The review is positive

Applications of BiDirectional LSTM

Some of the popular application which uses BiLSTM are sentiment analysis, text classification, text generation, and machine translation. You can also explore some of these applications in the following articles:

Bidirectional LSTM (BiLSTM)

Architecture

Importing Libraries and Dataset

Model Architecture

Model Training

Model Evaluation

Applications of BiDirectional LSTM

Recommend

Prog.AI

DoCast - Chromecast from iPhone to TV | Product Hunt

Every New Feature On iPadOS 17 You'll Want To Check Out

Here's A Look At Instagram's Upcoming Twitter Competitor

“花钱吃剩菜”，买吗？

CollegeCompass - The smart planner to get into your dream college | Product Hunt

独立！中国风投迎来震撼一幕

Rootspace - Open source productivity SaaS app | Product Hunt

Beta-divergence loss functions in Scikit Learn

预制菜能打破叮咚买菜的前置仓亏损“魔咒”？

About Joyk