Connectionist Temporal Classification (CTC) is a valuable operation to tackle sequence problems where timing is variable, like Speech and Handwriting recognition. Without CTC, you would need an aligned dataset, which in the case of Speech Recognition, would mean that every character of a transcription, would need to be aligned to its exact location in the audio file. Therefore, CTC makes training such a system a lot easier.
Table of Contents:
- Recurrent Neural Networks
- Basics RNNs and Speech Recognition
- Connectionist Temporal Classification
Recurrent Neural Networks
This post requires knowledge about Recurrent Neural Networks. If you aren’t familiar with this kind of Neural Networks, I encourage you to check out my article about them (click here).
Nevertheless, I will give you a little recap:
RNNs are the state of the art algorithm for sequential data. They have an internal time-dependent state (memory) due to the so-called “feedback-loop”. Because of that, they are good at predicting what’s coming next based on what happened before. This makes them well suited for problems that involve sequential data like speech recognition, handwriting recognition and so on.
To better understand RNNs, we have to look at what makes them different than a usual (feedforward) Neural Network. Take a look at the following example.
Imagine that we put the word „SAP“ into a feedforward Neural Network (FFNN) and into a Recurrent Neural Network (RNN) and that they would process it one character after the other. By the time, the FFNN reaches the letter „P“, it has already forgotten about „S“ and „A“, but a RNN doesn’t. This is due to the different way they process information, that the two image below illustrates.
In a FFNN, the information only moves in one direction (from input, through the hidden layers, through the output layers) and therefore, the information never touches a node twice. This is the reason why feedforward Neural Networks can’t remember what they received as input – they only remember the data they are trained upon.
In contrast, a RNN cycles information through a loop. When making a decision, a RNN takes into consideration the current input and also what it has learned from the previous inputs.
Basic RNNs and Speech Recognition
Speech Recognition is the task of translating spoken language into text by a computer. The problem is that you have an acoustic observation (some recording as an audio file) and you want to have a transcription of it, without manually creating it.
You know, that Recurrent Neural Networks are well suited for tasks that involve sequential data. And because speech is sequential data, in theory, you could train a RNN with a dataset of acoustic observations and their corresponding transcriptions. But like you probably already guessed, it isn’t that easy. That is because basic, also called canonical Recurrent Neural Networks require aligned data. This means that each character (not each word!) needs to be aligned to its location in the audio file. Just take a look at the image below and imagine that every character would need to be aligned to its exact location.
Of course, this is tedious work that no one wants to do and only a few organizations could afford to hire enough people to do that, which is the reason why there are very few datasets like this out there.
Connectionist Temporal Classification
Fortunately, we have Connectionist Temporal Classification (CTC), which is a way around not knowing the alignment between the input and the output. CTC is simply a loss function that is used to train Neural Networks, like Cross-Entropy and so on. It is used at problems, where having aligned data is an issue, like Speech Recognition.
Like I said, with CTC, there is no need for aligned data. That is because it can assign a probability for any label, given an input. This means it only requires an audio file as input and a corresponding transcription. But how can CTC assign a probability for any label, just given an input? Like I said, CTC is „alignment-free“. It works by summing over the probability of all possible alignments between the input and the label. To understand that, take a look at this naive approach:
Here we have an input of size 9 and the correct transcription of it is „Iphone“. We force our system to assign an output character to each input step and then we collapse the repeats, which results in the output. But this approach has two problems:
It does not make sense to force every input step to be aligned to some output, because we also need to account for silence within the input. There is no way to produce words as output that have two characters in a row, like a word „Kaggle“. If we use this approach, we could only produce “Kagle” as output.
There is a way around that, called the „blank token“. It does not mean anything and it simply gets removed before the final output is produced. Now, the algorithm can assign a character or the blank token to each input step. There is one rule: To allow double characters in the output, a blank token must be between them. With this simple rule, we can also produce output with two characters in a row.
Here is an illustration of how this works:
1.) The CTC network assigns the character with the highest probability (or no character) to every input sequence.
2.) Repeats without a blank token in between get merged.
3.) and lastly, the blank token gets removed.
The CTC network can then give you the probability of a label for a given input, by summing over all the probabilities of the character for each time-step.
In this post, we’ve reviewed the main facts about Recurrent Neural Networks and we learned why canonical RNNs aren’t really suited for sequence problems. Then we discussed what Connectionist Temporal Classification is, how it works and why it enables RNNs to tackle tasks like Speech Recognition.