Recurrent Neural Networks (RNNs) quickly became the go-to neural network architecture for Natural Language Processing (NLP) tasks. In this blog post, I’ll start with a broad definition of their architecture, and then explain what makes them so popular with the NLP community. Finally, I’ll list a collection of blog posts, tutorials, research papers, and frequently asked questions to help you discover the different flavours of RNNs.
Over the course of the last few years, recurrent architecture for neural networks established themselves as state-of-the-art in several NLP tasks, ranging from Named Entity RecognitionZhiheng Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging, 2015 to Language ModelingStephen Merity et al., Regularizing and Optimizing LSTM Language Models, 2017 through Machine TranslationYonghui Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016.
This successful breakthrough comes a long time after the first proposition of this kind of architecture, around 30 years agoJohn Hopfield, Neural networks and physical systems with emergent collective computational abilities, 1982(20 years ago for modern architecturesSepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, 1997).
I woke up at 4:20 AM to check on my RNN. It’s like having a Tamagotchi all over again. pic.twitter.com/ZkDPGp37oR
— Linda Liukas (@lindaliukas) September 18, 2017
The main advantage of RNNs resides in their ability to deal with sequential data, thanks to their “memory”. Whereas Artificial Neural Networks (ANNs) have no notion of time, and the only input they consider is the current example they are being fed, RNNs consider both the current input and a “context unit” built upon what they’ve seen previously.
So the prediction made by the network at timestep T is influenced by the one it made at timestep T – 1. And when you think about it, that’s pretty much what we do, as humans, we use our previous experience (T – 1) to handle new and unseen things (T).
Christopher Olah puts it very nicely in his blog postChristopher Olah, Understanding LSTMs, 2015:
"As you read this essay, you understand each word based on your understanding of previous words. You don’t throw everything away and start thinking from scratch again."
And luckily for us, NLP is full of sequential (or temporal) data. Be it sentences, words, or characters, we always use the context to establish a more precise meaning for communication, whether it is written or oral.
Here’s a few examples:
In Machine Translation, a word will carry different meanings based on the context. Sentiment Analysis will detect modifiers (like “very”, “not”, and “a bit too”) to grasp the intensity, polarity or negation of a sentiment. In Dialog Management, the next step of a conversation is conditioned by the previous interactions and the goal given to the system. For Tokenization, we can use the next and previous characters to say whether or not a new word is beginning.
And it doesn’t stop there: Part of Speech Tagging, Sentence Segmentation, Language Modeling, Semantic Role Labelling, Text Summarization, Spell Checking, and a whole lot of other tasks rely on the sequential nature of the data.
But RNNs are not perfect yet: the need for the last timestep result at each timestep computation makes them slow to train, and computationally expensive. Today, more and more researchers are using Convolutional Neural Networks (CNNs), because they offer speed and accuracy improvements in many tasks.
Still, the phrase “an LSTM with an attention layer will yield state-of-the-art results on any task” is not to be forgotten, and recurrent architectures will populate user-facing NLP systems and benchmark baselines for a long time.
- Recurrent Neural Networks Tutorial Part 1: Introduction to RNNs, Denny Britz, 2015
- Understanding LSTMs, Christopher Olah, 2015
- Recurrent Neural Networks & LSTMs, Rohan Kapur, 2017
- Evolution: from vanilla RNN to GRU & LSTMs, Maxim Kolomeychenko, Yuri Borisov, 2017
- Natural Language Processing in Artificial Intelligence is almost human-level accurate, Rafal Karczewski, 2017
- A Beginner’s Guide to Recurrent Networks and LSTMs, DL4J, unknown
- The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy, 2015
- The unreasonable effectiveness of Character-level Language Models, Yorav Goldberg, 2015
- Attention and Augmented Recurrent Neural Networks, Christopher Olah, Shan Carter, 2016
- Backpropagating an LSTM: A Numerical Example, Aidan Gomez, 2016
- Written Memories: Understanding, Deriving and Extending the LSTM, R2RT, 2016
- Non-Zero Initial States for Recurrent Neural Networks, R2RT, 2016
- Attention Mechanism, Leonard Blier, 2016
- Exploring LSTMs, Edwin Chen, 2017
- Interpreting neurons in an LSTM network, Tigran Galstyan, Hrant Khachatrian, 2017
- Deep Learning for NLP Best Practices, Sebastian Ruder, 2017
- LSTM Forward and Backward Pass, Arun Mallya, unknown
- LSTM implementation explained, Adam Paszke, 2015
- Recurrent Neural Networks Tutorial, Denny Britz, 2015
- Generating Constit, 2015ution with recurrent neural networks, Narek Hovsepyan, Hrant Khachatrian, 2015
- How to build a Recurrent Neural Network in TensorFlow, Erik Hallström, 2016
- Recurrent Neural Networks in Tensorflow I, R2RT, 2016
- Automatic transliteration with LSTM, Tigran Galstyan et al., 2016
- RNNs in Tensorflow, a Practical Guide and Undocumented Features, Denny Britz, 2016
- Recurrent Neural Networks, TensorFlow, 2017
- How to implement a recurrent neural network, Peter Roelants, unknown
- Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, Junyoung Chung et al., 2014
- LSTM: A Search Space Odyssey, Klaus Greff et al., 2015
- A Critical Review of Recurrent Neural Networks for Sequence Learning, Zachary C. Lipton et al., 2015
- An Empirical Exploration of Recurrent Network Architectures, Rafal Jozefowicz et al., 2015
- Visualizing and Understanding Recurrent Networks, Andrej Karpathy et al., 2015
- A survey on the application of recurrent neural networks to statistical language modeling, Wim De Mulder et al., 2015
- Long Short-Term Memory in Recurrent Neural Networks, Felix Gers, 2001
- Supervised Sequence Labelling with Recurrent Neural Networks, Alex Graves, 2008
- Statistical Language Models Based on Neural Networks, Tomas Mikolov, 2012
- Training Recurrent Neural Networks, Ilya Sutskever, 2013
- Recursive Deep Learning for Natural Language Processing and Computer Vision, Richard Socher, 2014
- Neural networks and physical systems with emergent collective computational abilities, John Hopfield, 1982
- Serial order: A parallel distributed processing approach, Michael Jordan, 1986
- Finding structure in time, Jeffrey Elman, 1990
- Backpropagation through time: what it does and how to do it, Paul Werbos, 1990
- Learning Long-Term Dependencies with Gradient Descent is Difficult, Yoshua Bengio et al., 1994
- Long Short-Term Memory, Sepp Hochreiter, Jürgen Schmidhuber, 1997
- Bidirectional Recurrent Neural Networks, Mike Schuster, Kuldip K. Paliwal, 1997
- Multi-Dimensional Recurrent Neural Networks, Alex Graves et al., 2007
- Parsing Natural Scenes and Natural Language with Recursive Neural Networks, Richard Socher et al., 2011
- On the difficulty of training Recurrent Neural Networks, Razvan Pascanu et al., 2012
- LSTM Neural Networks for Language Modeling, Martin Sundermeyer et al., 2012
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho et al., 2014
- On the Properties of Neural Machine Translation: Encoder–Decoder Approaches, Kyunghyun Cho et al., 2014
- Sequence to Sequence Learning with Neural Networks, Ilya Sutskever et al., 2014
- Regularizing RNNs by Stabilizing Activations, David Krueger, Roland Memisevic, 2015
- Neural Responding Machine for Short-Text Conversation, Lifeng Shang et al., 2015
- A Neural Conversational Model, Oriol Vinyals, Quoc Le, 2015
- Top-down Tree Long Short-Term Memory Networks, Xingxing Zhang et al., 2015
- Attention with Intention for a Neural Network Conversation Model, Kaisheng Yao et al., 2015
- Bidirectional LSTM-CRF Models for Sequence Tagging, Zhiheng Huang et al., 2015
- Quasi-Recurrent Neural Networks, James Bradbury et al., 2016
- A Neural Knowledge Language Model, Sungjin Ahn et al., 2016
- Contextual LSTM (CLSTM) models for Large scale NLP tasks, Shalini Ghosh et al., 2016
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, Yonghui Wu et al., 2016
- Recurrent Highway Networks, Julian Georg Zilly et al., 2016
- Recurrent Memory Array Structures, Kamil Rocki, 2016
- Hierarchical Multiscale Recurrent Neural Networks, Junyoung Chung et al., 2016
- Smart Reply: Automated Response Suggestion for Email, Anjuli Kannan et al., 2016
- Dialog-based Language Learning, Jason Weston, 2016
- Learning End-to-End Goal-Oriented Dialog, Antoine Bordes et al., 2016
- Massive Exploration of Neural Machine Translation Architectures, Denny Britz et al., 2017
- Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks, Nils Reimers, Iryna Gurevych, 2017
- Regularizing and Optimizing LSTM Language Models, Stephen Merity et al., 2017
- Training RNNs as Fast as CNNs, Tao Lei, Yu Zhang, 2017
How are recurrent neural networks different from convolutional neural networks?
What is the difference between Recurrent Neural Networks and Recursive Neural Networks?
What is the difference between LSTM and GRU for RNNs?
How to select the number of hidden layers/hidden units in LSTM ?
What’s so great about LSTMs?
What is masking in a Recurrent Neural Network?
What is the attention mechanism introduced in RNNs?
Is LSTM turing complete?
When should one decide to use a LSTM in a Neural Network?
Why doesn’t LSTM forget gate cause a vanishing/dying gradient?
What is the difference between states and outputs in an LSTM?
Is it possible to do online learning with LSTMs?
Also published on Medium.
References [ + ]
|1.||↑||Zhiheng Huang et al., Bidirectional LSTM-CRF Models for Sequence Tagging, 2015|
|2.||↑||Stephen Merity et al., Regularizing and Optimizing LSTM Language Models, 2017|
|3.||↑||Yonghui Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, 2016|
|4.||↑||John Hopfield, Neural networks and physical systems with emergent collective computational abilities, 1982|
|5.||↑||Sepp Hochreiter, Jürgen Schmidhuber, Long Short-Term Memory, 1997|
|6.||↑||Christopher Olah, Understanding LSTMs, 2015|