RNNs and the Evolution of Attention

saravana alagar
Analytics Vidhya
Published in
7 min readFeb 2, 2021

--

Predicting the future has been one of the most fascinating subjects that have been in practice and have evolved over a period of time. Especially in this information age, the development of modern computing and the advancement in Artificial Intelligence has only made it more interesting than ever. Sequence-to-Sequence modeling has been one of the prime research areas where the model learns observations captured over a period of time and able to generate such patterns given a scenario. Though statistical methods have been around for a long period of time the latest advancements in Neural Networks especially RNNs are stealing the show pretty much in all applications nowadays like Natural Language Processing, Speech-to-Text, Timeseries Analysis, etc. We are going to look at an overview of RNNs, their popular variants LSTM and GRU, and the Evolution of attention mechanism that stays at the heart of today’s most advanced networks such as GPT3.

Recurrent Neural Networks(RNN):

A typical fully-connected neural network has 3 layers namely Input, hidden, and output layers as we already know. These are excellent in terms of regression and classification problems where the inputs are not dependent on each other or there is no correlation between one input/observation to the other.

Fig: 1a — A Fully-Connected Network

But in a sequence prediction problem the observations are correlated with each other by time hence a fully-connected layer, by design cannot model such data was remembering what happened in the previous step, which impacts the current prediction. This is the exact reason why RNNs are conceived as a concept to address such memory/context issues.

Fig: 1b — Recurrent Neural Network

As illustrated above RNNs are formed by simply adding a feedback loop from the output to the next input, this design provides the network an idea about what the previous observation was which inherently provides RNNs the ability to remember the context. Let’s see how it back-propagates.

Back Propagation Through Time:

Unlike in Fully-connected layers, the input is sequentially fed through the RNNs hence the back-propagation happens for each time-step. So the weight updates or gradient flow back-propagates through time. For example, in the comparison between the input forward between FC(Fully-Connected) and RNN as illustrated below, we clearly see that the FC network will have the gradient update once but in the case of RNN, it will be 5 times.

Because of this nature RNNs are required to backpropagate for each time step. If the sequence is long the gradient may shrink exponentially as it back-propagates through time. This phenomenon is called a Vanishing Gradient problem where the RNN fails to learn(update weights) in long sequences or otherwise called the context across different time-steps. To understand this let us consider a movie review that says “It is highly unlikely that I would not recommend this movie” which is a positive review. But because of RNNs’ inability to remember the long-term dependencies or short-term memory, they might only see “not recommend this movie” and classify this as a negative review.

To address the Short-Term memory issue in the RNNs new variants of RNNs were introduced named Long-Short Term Memory(LSTM) and Gated Recurrent Units(GRU)

Long-Short Term Memory(LSTM)

Imagine when you went on the Roller-Coaster ride for the first time, in every twist and turn, peak and trough you screamed to death as if it were your last moment. But as you get used to the ride and take it more frequently those screams become a joy and now you seem to enjoy it. The reason for this transition is because your brain is capable of retaining the experience over a period of time and somewhere in the back of your mind you know that the ride is safe and joyful. That somewhere-in-the-back-of-the-mind stuff is known as the context! This is exactly what LSTM is trying to do.

LSTMs have a cell-state/context vector which retains long-term dependencies running along the entirety of the network.

Fig: 3a LSTM Architecture

As illustrated above the horizontal line that runs along the top of the network is called Cell-State which retains the memory. Now let’s see how different components function together and the necessity of their existence.

The main objective of LSTM is to maintain and make use of the C. There are Gates and Activation functions to do that.

Forget Gate: Forget gate controls what needs to be removed from the memory. This helps the network to remove unwanted noise from spoiling the prediction.

Input Gate: The input gate controls what needs to be added to the memory that might be essential in the new input. This filters the candidate memory data that needs to be added to Cell state C.

Output Gate: The output gate produces the new hidden state by multiplying the current input with the memory C.

Sigmoid Function

Fig: 3b — Sigmoid Function

The sigmoid function returns values ranging from 0 to 1 for any given input. Because of this property, this function is used as a Selection Function. It allows the values closer to 1 to pass through and filters the values closer to 0. That is why this function is used in Input and Forget Gates in the LSTM cell.

tanh Function

Fig: 3c — tanh function

Tanh function produces output between -1 and +1. This helps the output to be a small number which otherwise would explode when the number of iterations increases.

Gated Recurrent Unit(GRU)

Fig 4a GRU Architecture

GRUs are a slight modification over LSTMs.

  • It combines the forget and input gates into a single “update gate”.
  • It also merges the cell state and hidden state
  • It updates the memory twice, the first time (using old state and new input, called Reset Gate) and the second time (as final output).
  • Old cell state or hidden state (with input) is used for its own update as well as for prediction.

Encoder and Decoder Architecture in RNNs

Unlike a classification or a regression problem, a sequence to sequence prediction problem is challenging because the number of input and output observations is uncertain. To mitigate this problem Encoder-Decoder architecture was introduced.

Fig 5a encoder-decoder network

In this architecture, the C is passed from encoder to decoder block. The complexity here is that the C vector needs to efficiently encode all of the inputs. On the other hand, the decoder should be capable of precisely decoding the predictions with very little error. And in sequence prediction, the decoder may not know what is the outcome of other decoders. To overcome these issues attention mechanism was introduced

Attention

Fig 6a Encoder-Decoder architecture with attention

The attention layer solves the problem by taking the hidden states from all the decoder layers and assign a weight to the states. These weighted states are passed to the decoder layers to predict the y1. Now the cell state S1 is fed back to the Attention layer along with the hidden states. This time the attention layer assigns different weights to the hidden states because it has an idea about what the decoder layer has already predicted. This happens sequentially at the decoder end and eliminates the limitation of predicting sequence from a single content vector.

The Attention layer is simply a fully connected layer that learns attention weights while training. This mechanism is proven to be very powerful which lays the foundation for Transformers and its variants.

We are in exciting times where the amount of research done on a day-to-day basis is increasing in the field of AI. It is important to understand the core concepts to understand new research areas, applications, and stay up to date. Thanks for reading!

Originally published at https://www.linkedin.com.

--

--

saravana alagar
Analytics Vidhya

Data Scientist, Deep Learning Researcher and Robotics practitioner