http://archive.pkmital.com caffe - http://archive.pkmital.com

Long Short Term Memory (LSTM) is a Recurrent Neural Network (RNN) architecture designed to better model temporal sequences (e.g. audio, sentences, video) and long range dependencies than conventional RNNs [1]. There is a lot of excitement in the machine learning communities with LSTMs (and Deep Minds’s counterpart, “Neural Turing Machines” [2], or Facebook’s, “Memory Networks” [3]) as they overcome a fundamental limitation to conventional RNNs and are able to achieve state-of-the-art benchmark performances on a number of tasks [4,5]:

Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014)
Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014)
Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014)
Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014)
Medium vocabulary speech recognition (Geiger et al., Interspeech 2014)
English to French translation (Sutskever et al., Google, NIPS 2014)
Audio onset detection (Marchi et al., ICASSP 2014)
Social signal classification (Brueckner & Schulter, ICASSP 2014)
Arabic handwriting recognition (Bluche et al., DAS 2014)
TIMIT phoneme recognition (Graves et al., ICASSP 2013)
Optical character recognition (Breuel et al., ICDAR 2013)
Image caption generation (Vinyals et al., Google, 2014)
Video to textual description (Donahue et al., 2014)

The current dynamic state … Continue reading...

I’ve spent a little time with Caffe over the holiday break to try and understand how it might work in the context of real-time visualization/object recognition in more natural scenes/videos. Right now, I’ve implemented the following Deep Convolution Networks using the 1280×720 resolution webcamera on my 2014 Macbook Pro:

VGG ILSVRC 2014 (16 Layers): 1000 ImageNet Object Categories (~ 7 FPS)
VGG ILSVRC 2014 (19 Layers): 1000 Object Categories (~5 FPS)
BVLC GoogLeNet: 1000 Object Categories (~ 24 FPS)
Region-CNN ILSVRC 2013: 200 Object Categories (~ 22 FPS)
BVLC Reference CaffeNet: 1000 Object Categories (~ 18 FPS)
BVLC Reference CaffeNet (Fully Convolutional) 8×8: 1000 Object Categories (~12 FPS)
BVLC Reference CaffeNet (Fully Convolutional) 34×17: 1000 Object Categories (~1 FPS)
MIT Places-CNN Hybrid (Places + ImageNet): 971 Object Categories + 200 Scene Categories = 1171 Categories (~ 12 FPS)

The above image depicts the output from an 8×8 grid detection showing brighter regions as higher probabilities of the class “snorkel” (automatically selected by the network from 1000 possible classes as the highest probability).

So far I have spent some time understanding how Caffe keeps each layer’s data during a forward/backward pass, and how the deeper layers could be “visualized” in a … Continue reading...

Archived entries for caffe

Handwriting Recognition with LSTMs and ofxCaffe

Real-Time Object Recognition with ofxCaffe