**Recurrent Neural Networks are the state of the art algorithm for sequential data and among others used by Apples Siri and Googles Voice Search. This is because it is the first algorithm that remembers its input, due to an internal memory, which makes it perfectly suited for Machine Learning problems that involve sequential data. It is one of the algorithms behind the scenes of the amazing achievements of Deep Learning in the past few years. In this post, you will learn the basic concepts of how Recurrent Neural Networks work, what the biggest issues are and how to solve them.**

**Table of Contents:**

- Introduction
- How it works
- Feed-Forward Neural Networks
- Recurrent Neural Networks

- Backpropagation Through Time
- Two issues of standard RNN’s
- Exploding Gradients
- Vanishing Gradients

- Long-Short Term Memory
- Summary

**Introduction**

Recurrent Neural Networks (RNN) are a powerful and robust type of neural networks and belong to the most promising algorithms out there at the moment because they are the only ones with an internal memory.

RNN’s are relatively old, like many other deep learning algorithms. They were initially created in the 1980’s, but can only show their real potential since a few years, because of the increase in available computational power, the massive amounts of data that we have nowadays and the invention of LSTM in the 1990’s.

Because of their internal memory, RNN’s are able to remember important things about the input they received, which enables them to be very precise in predicting what’s coming next.

This is the reason why they are the preferred algorithm for sequential data like time series, speech, text, financial data, audio, video, weather and much more because they can form a much deeper understanding of a sequence and its context, compared to other algorithms.

*Recurrent Neural Networks produce predictive results in sequential data that other algorithms can’t.*

But when do you need to use a Recurrent Neural Network ?

**“Whenever there is a sequence of data and that temporal dynamics that connects the data is more important than the spatial content of each individual frame.” **

**– Lex Fridman (MIT)**

Since they are being used in the software behind Siri from Apple and Google Translate, Recurrent Neural Networks are showing up everywhere.

**How they work**

We will first discuss some important facts about the „normal“ Feed Forward Neural Networks, that you need to know, to understand Recurrent Neural Networks properly.

But it is also important that you understand what sequential data is. It basically is just ordered data, where related things follow each other. Examples are financial data or the DNA sequence. The most popular type of sequential data is perhaps Time series data, which is just a series of data points that are listed in time order.

**Feed-Forward Neural Networks**

RNN’s and Feed-Forward Neural Networks are both named after the way they channel information.

In a Feed-Forward neural network, the information only moves in one direction, from the input layer, through the hidden layers, to the output layer. The information moves straight through the network. Because of that, the information never touches a node twice.

Feed-Forward Neural Networks, have no memory of the input they received previously and are therefore bad in predicting what’s coming next. Because a feedforward network only considers the current input, it has no notion of order in time. They simply can’t remember anything about what happened in the past, except their training.

**Recurrent Neural Networks**

In a RNN, the information cycles through a loop. When it makes a decision, it takes into consideration the current input and also what it has learned from the inputs it received previously.

The two images below illustrate the difference in the information flow between a RNN and a Feed-Forward Neural Network.

A usual RNN has a short-term memory. In combination with a LSTM they also have a long-term memory, but we will discuss this further below.

Another good way to illustrate the concept of a RNN’s memory is to explain it with an example:

Imagine you have a normal feed-forward neural network and give it the word „neuron“ as an input and it processes the word character by character. At the time it reaches the character „r“, it has already forgotten about „n“, „e“ and „u“, which makes it almost impossible for this type of neural network to predict what character would come next.

A Recurrent Neural Network is able to remember exactly that, because of it’s internal memory. It produces output, copies that output and loops it back into the network.

*Recurrent Neural Networks add the immediate past to the present.*

Therefore a Recurrent Neural Network has two inputs, the present and the recent past. This is important because the sequence of data contains crucial information about what is coming next, which is why a RNN can do things other algorithms can’t.

A Feed-Forward Neural Network assigns, like all other Deep Learning algorithms, a weight matrix to its inputs and then produces the output. Note that RNN’s apply weights to the current and also to the previous input. Furthermore they also tweak their weights for both through gradient descent and Backpropagation Through Time, which we will discuss in the next section below.

Also note that while Feed-Forward Neural Networks map one input to one output, RNN’s can map one to many, many to many (translation) and many to one (classifying a voice).

**Backpropagation Through Time**

To understand the concept of Backpropagation Through Time you definitely have to understand the concepts of Forward and Back-Propagation first. I will not go into the details here because that would be way out of the limit of this blog post, so I will try to give you a definition of these concepts that is as simple as possible but allows you to understand the overall concept of backpropagation through time.

In neural networks, you basically do Forward-Propagation to get the output of your model and check if this output is correct or incorrect, to get the error.

Now you do Backward-Propagation, which is nothing but going backwards through your neural network to find the partial derivatives of the error with respect to the weights, which enables you to subtract this value from the weights.

Those derivatives are then used by Gradient Descent, an algorithm that is used to iteratively minimize a given function. Then it adjusts the weights up or down, depending on which decreases the error. That is exactly how a Neural Network learns during the training process.

So with Backpropagation you basically try to tweak the weights of your model, while training.

The image below illustrates the concept of Forward Propagation and Backward Propagation perfectly at the example of a Feed Forward Neural Network:

Backpropagation Through Time (BPTT) is basically just a fancy buzz word for doing Backpropagation on an unrolled Recurrent Neural Network. Unrolling is a visualization and conceptual tool, which helps you to understand what’s going on within the network. Most of the time when you implement a Recurrent Neural Network in the common programming frameworks, they automatically take care of the Backpropagation but you need to understand how it works, which enables you to troubleshoot problems that come up during the development process.

You can view a RNN as a sequence of Neural Networks that you train one after another with backpropagation.

The image below illustrates an unrolled RNN. On the left, you can see the RNN, which is unrolled after the equal sign. Note that there is no cycle after the equal sign since the different timesteps are visualized and information gets’s passed from one timestep to the next. This illustration also shows why a RNN can be seen as a sequence of Neural Networks.

If you do Backpropagation Through Time, it is required to do the conceptualization of unrolling, since the error of a given timestep depends on the previous timestep.

Within BPTT the error is back-propagated from the last to the first timestep, while unrolling all the timesteps. This allows calculating the error for each timestep, which allows updating the weights. Note that BPTT can be computationally expensive when you have a high number of timesteps.

**Two issues of standard RNN’s**

There are two major obstacles RNN’s have or had to deal with. But to understand them, you first need to know what a gradient is.

A gradient is a partial derivative with respect to its inputs. If you don’t know what that means, just think of it like this: A gradient measures how much the output of a function changes, if you change the inputs a little bit.

You can also think of a gradient as the slope of a function. The higher the gradient, the steeper the slope and the faster a model can learn. But if the slope is zero, the model stops to learning. A gradient simply measures the change in all weights with regard to the change in error.

#### Exploding Gradients

We speak of „Exploding Gradients“ when the algorithm assigns a stupidly high importance to the weights, without much reason. But fortunately, this problem can be easily solved if you truncate or squash the gradients.

#### Vanishing Gradients

We speak of „Vanishing Gradients“ when the values of a gradient are too small and the model stops learning or takes way too long because of that. This was a major problem in the 1990s and much harder to solve than the exploding gradients. Fortunately, it was solved through the concept of LSTM by Sepp Hochreiter and Juergen Schmidhuber, which we will discuss now.

**Long-Short Term Memory**

Long Short-Term Memory (LSTM) networks are an extension for recurrent neural networks, which basically extends their memory. Therefore it is well suited to learn from important experiences that have very long time lags in between.

The units of an LSTM are used as building units for the layers of a RNN, which is then often called an LSTM network.

LSTM’s enable RNN’s to remember their inputs over a long period of time. This is because LSTM’s contain their information in a memory, that is much like the memory of a computer because the LSTM can read, write and delete information from its memory.

This memory can be seen as a gated cell, where gated means that the cell decides whether or not to store or delete information (e.g if it opens the gates or not), based on the importance it assigns to the information. The assigning of importance happens through weights, which are also learned by the algorithm. This simply means that it learns over time which information is important and which not.

In a RNN you have three gates: input, forget and output gate. These gates determine whether or not to let new input in (input gate), delete the information because it isn’t important (forget gate) or to let it impact the output at the current time step (output gate). You can see an illustration of a RNN with its three gates below:

The gates in a LSTM are analog, in the form of sigmoids, meaning that they range from 0 to 1. The fact that they are analog, enables them to do backpropagation with it.

The problematic issues of vanishing gradients is solved through LSTM because it keeps the gradients steep enough and therefore the training relatively short and the accuracy high.

**Summary**

Now you have proper understanding of how a Recurrent Neural Network works, which enables you to decide if it is the right algorithm to use for a given Machine Learning problem.

Specifically, you learned whats the difference between a Feed-Forward Neural Network and a RNN, when you should use a Recurrent Neural Network, how Backpropagation and Backpropagation Through Time works, what the main issues of a RNN are and how a LSTM works.

## Leave a Reply