Long Short Term Memory (LSTM) is a Recurrent Neural Network (RNN) architecture designed to better model temporal sequences (e.g. audio, sentences, video) and long range dependencies than conventional RNNs [1]. There is a lot of excitement in the machine learning communities with LSTMs (and Deep Minds’s counterpart, “Neural Turing Machines” [2], or Facebook’s, “Memory Networks” [3]) as they overcome a fundamental limitation to conventional RNNs and are able to achieve state-of-the-art benchmark performances on a number of tasks [4,5]:
- Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014)
- Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014)
- Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014)
- Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014)
- Medium vocabulary speech recognition (Geiger et al., Interspeech 2014)
- English to French translation (Sutskever et al., Google, NIPS 2014)
- Audio onset detection (Marchi et al., ICASSP 2014)
- Social signal classification (Brueckner & Schulter, ICASSP 2014)
- Arabic handwriting recognition (Bluche et al., DAS 2014)
- TIMIT phoneme recognition (Graves et al., ICASSP 2013)
- Optical character recognition (Breuel et al., ICDAR 2013)
- Image caption generation (Vinyals et al., Google, 2014)
- Video to textual description (Donahue et al., 2014)
The current dynamic state … Continue reading...