http://archive.pkmital.com computer vision - http://archive.pkmital.com

Archived entries for computer vision

Toolkit for Visualizing Eye-Movements and Processing Audio/Video

February 6, 2015
In computer vision, technology, visual cognition
4 comments

Original video still without eye-movements and heatmap overlay copyright Dropping Knowledge Video Republic.

From 2008 – 2010, I worked on the Dynamic Images and Eye-Movements (D.I.E.M.) project, led by John Henderson, with Tim Smith and Robin Hill. We worked together to collect nearly 200 participants eye-movements on nearly 100 short films from 30 seconds to 5 minutes in length. The database is freely available and covers a wide range of film styles form advertisements, to movie and music trailers, to news clips. During my time on the project, I developed an open source toolkit, C.A.R.P.E. to complement D.I.E.M., or Computational Algorithmic Representation and Processing of Eye-movements (Tim’s idea!), for visualizing and processing the data we collected, and used it for writing up a journal paper describing a strong correlation between tightly clustered eye-movements and the motion in a scene. We also output visualizations of our entire corpus on our Vimeo channel. The project came to a halt and so did the visualization software. I’ve since picked up the ball and re-written it entirely from the ground up.

The image below shows how you can represent the movie, the motion in the scene of the movie (represented in … Continue reading...

Handwriting Recognition with LSTMs and ofxCaffe

February 6, 2015
In computer vision, source code, technology
No comments yet

Long Short Term Memory (LSTM) is a Recurrent Neural Network (RNN) architecture designed to better model temporal sequences (e.g. audio, sentences, video) and long range dependencies than conventional RNNs [1]. There is a lot of excitement in the machine learning communities with LSTMs (and Deep Minds’s counterpart, “Neural Turing Machines” [2], or Facebook’s, “Memory Networks” [3]) as they overcome a fundamental limitation to conventional RNNs and are able to achieve state-of-the-art benchmark performances on a number of tasks [4,5]:

Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014)
Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014)
Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014)
Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014)
Medium vocabulary speech recognition (Geiger et al., Interspeech 2014)
English to French translation (Sutskever et al., Google, NIPS 2014)
Audio onset detection (Marchi et al., ICASSP 2014)
Social signal classification (Brueckner & Schulter, ICASSP 2014)
Arabic handwriting recognition (Bluche et al., DAS 2014)
TIMIT phoneme recognition (Graves et al., ICASSP 2013)
Optical character recognition (Breuel et al., ICDAR 2013)
Image caption generation (Vinyals et al., Google, 2014)
Video to textual description (Donahue et al., 2014)

The current dynamic state … Continue reading...

Real-Time Object Recognition with ofxCaffe

January 4, 2015
In computer vision, neuroscience, technology
4 comments

I’ve spent a little time with Caffe over the holiday break to try and understand how it might work in the context of real-time visualization/object recognition in more natural scenes/videos. Right now, I’ve implemented the following Deep Convolution Networks using the 1280×720 resolution webcamera on my 2014 Macbook Pro:

VGG ILSVRC 2014 (16 Layers): 1000 ImageNet Object Categories (~ 7 FPS)
VGG ILSVRC 2014 (19 Layers): 1000 Object Categories (~5 FPS)
BVLC GoogLeNet: 1000 Object Categories (~ 24 FPS)
Region-CNN ILSVRC 2013: 200 Object Categories (~ 22 FPS)
BVLC Reference CaffeNet: 1000 Object Categories (~ 18 FPS)
BVLC Reference CaffeNet (Fully Convolutional) 8×8: 1000 Object Categories (~12 FPS)
BVLC Reference CaffeNet (Fully Convolutional) 34×17: 1000 Object Categories (~1 FPS)
MIT Places-CNN Hybrid (Places + ImageNet): 971 Object Categories + 200 Scene Categories = 1171 Categories (~ 12 FPS)

The above image depicts the output from an 8×8 grid detection showing brighter regions as higher probabilities of the class “snorkel” (automatically selected by the network from 1000 possible classes as the highest probability).

So far I have spent some time understanding how Caffe keeps each layer’s data during a forward/backward pass, and how the deeper layers could be “visualized” in a … Continue reading...

YouTube’s “Copyright School” Smash Up

November 9, 2012
In art, audio-visual, computer vision, copyright
No comments yet

Ever wonder what happens when you’ve been accused of violating copyright multiple times on YouTube? First, you get a redirect to YouTube’s “Copyright School” whenever you visit YouTube, forcing you to watch a cartoon of Happy Tree Friends where the main character is dressed as an actual pirate:

Second, I’m guessing, your account will be banned. Third, you cry and wonder why you ever violated copyright in the first place.

In my case, I’ve disputed every one of the 4 copyright violation notices that I’ve received under grounds of Fair Use and Fair Dealing. Here’s what happens when you file a dispute using YouTube’s online form (click for high-res):

3 of the 4 have been dropped after I’ve filed disputes, though I’m still waiting to hear about the response to the above dispute. Read the dispute letter to Sony ATV and UPMG Publishers in full here.

The picture above shows a few stills from what my Smash Ups look like. The process described in greater detail on createdigitalmotion.com is part of my ongoing research into how existing content can be transformed into artistic styles reminiscent of analytic cubist, figurative, and futurist paintings. The process to create the videos … Continue reading...

An open letter to Sony ATV and UMPG

November 9, 2012
In art, computer vision
No comments yet

Dear Sony ATV Publishing, UMPG Publishing, and other concerned parties,

I ask you to please withdraw your copyright violation notice on my video, “PSY – GANGNAM STYLE (?????) M/V (YouTube SmashUp)” as I believe my use of any copyrighted material is protected under Fair Use or Fair Dealing. This video was created by an automated process as part of an art project developed during my PhD at Goldsmiths, University of London: http://archive.pkmital.com/projects/visual-smash-up/ and http://archive.pkmital.com/projects/youtube-smash-up/

The process which creates the audio and video is entirely automated meaning the accused video is created by an algorithm. This algorithm begins by first creating a large database of tiny fragments of audio and video (less than 1 second of audio per fragment) using 9 videos from YouTube’s top 10 list. From this database, the tiny fragments of video and audio are stored as unrelated pieces of information and described only by a short series of 10-15 numbers. These numbers represent low-level features describing the texture and shape of the fragment of audio or video. These tiny fragments are then matched to the tiny fragments of audio and video detected within the target for resynthesis, in this case the number one YouTube video … Continue reading...

Copyright Violation Notice from “Rightster”

October 16, 2012
In art, computer vision, technology
2 comments

I’ve been working on an art project which takes the top 10 videos in YouTube and tries to resynthesize the #1 video in YouTube using the remaining 9 videos. The computational model is based on low-level human perception and uses only very abstract features such as edges, textures, and loudness. I’ve created a new synthesis each week using the top 10 of the week in the hopes that, one day, I will be able to resynthesize my own video in the top 10. It is a viral algorithm essentially but it is not proven if it will succeed or not.

The database of content used in the recreation of the above video comes from the following videos:
#2 News Anchor FAIL Compilation 2012 || PC
#3 Flo Rida – Whistle [Official Video]
#4 Carly Rae Jepsen – Call Me Maybe
#5 Jennifer Lopez – Goin’ In ft. Flo Rida
#6 Taylor Swift – We Are Never Ever Getting Back Together
#7 will.i.am – This Is Love ft. Eva Simons
#8 Call Me Maybe – Carly Rae Jepsen (Chatroulette Version)
#9 Justin Bieber – As Long As You Love Me ft. Big Sean
#10 Rihanna – Where Have You Been

It … Continue reading...

Concatenative Video Synthesis (or Video Mosaicing)

October 8, 2011
In art, audio-visual, computer vision, technology, visual cognition
2 comments

Working closely with my adviser Mick Grierson, I have developed a way to resynthesize existing videos using material from another set of videos. This process starts by learning a database of objects that appear in the set of videos to synthesize from. The target video to resynthesize is then broken into objects in a similar manner, but also matched to objects in the database. What you get is a resynthesis of the video that appears as beautiful disorder. Here are two examples, the first using Family Guy to resynthesize The Simpsons. And the second using Jan Svankmajer’s Food to resynthesize Jan Svankmajer’s Dimensions of Dialogue.

Facial Appearance Modeling/Tracking

I’ve been working on developing a method for automatic head-pose tracking, and along the way have come to model facial appearances. I start by initializing a facial bounding box using the Viola-Jones detector, a well known and robust detector used for training objects. This allows me to centralize the face. Once I know where the 2D plane of the face is in an image, I can register an Active Shape Model like so:

After multiple views of the possible appearance variations of my face, including slight rotations, I construct an appearance model.

The idea I am working with is using the first components of variations of this appearance model for determining pose. Here I show the first two basis vectors and the images they reconstruct:

As you may notice, these two basis vectors very neatly encode rotation. By looking at the eigenvalues of the model, you can also interpret pose.… Continue reading...

Tim J Smith guest blogs for David Bordwell

February 20, 2011
In audio-visual, computer vision, visual cognition
No comments yet

Tim J Smith, expert in scene perception and film cognition, and of The DIEM project [1] recently starred as a guest blogger for David Bordwell, a leading film theorist with an impressive list of books and publications widely used in film cognition/film art research/studies [2]. In his article featured on David’s site, Tim expands on his research on film cognition including continuity editing [3], attentional synchrony [4], and the project we worked on in 2008-2010 as part of The DIEM Project. Since Tim’s feature on David Bordwell’s blog, The DIEM Project saw a surge of publicity and our vimeo video loads going higher than 200,000 in a single day and features on dvice, slashfilm, gizmodo, Rogert Ebert’s facebook/twitter, and the front page of imbd.com.

Not to mention, our tools and visualizations are finally reaching an audience with interests in film, photography, and cognition. If you haven’t yet seen some of our videos, please head on over to our vimeo page, where you can see a range of videos embedded with eye-tracking of participants and many different visualizations of models of eye-movements using machine learning, or start by reading Tim’s post on … Continue reading...

Responsive Ecologies Documentation

As part of a system of numerous dynamic connections and networks, we are reactive and deterministic to a complex system of cause and effect. The consequence of our actions upon our selves, the society we live in and the broader natural world is conditioned by how we perceive our involvement. The awareness of how we have impacted on a situation is often realised and processed subconsciously, the extent and scope of these actions can be far beyond our knowledge, our consideration, and importantly beyond our sensory reception. With this in mind, how can we associate our actions, many of which may be overlooked as customary, with for instance, the honey bee depopulation syndrome or the declining numbers of Siberian Tigers.

Responsive Ecologies is part of an ongoing collaboration with ZSL London Zoo and Musion Academy. Collectively we have been exploring innovative means of public engagement, to generate an awareness and understanding of nature and the effects of climate change. All of the contained footage has come from filming sessions within the Zoological Society; this coincidentally has raised some interesting questions on the spectacle of captivity, a issue which we have tried to reflect upon in the construction and presentation of … Continue reading...

Streaming Motion Capture Data from the Kinect using OSC on Mac OSX

January 24, 2011
In computer vision, technology
17 comments

This guide will help to get you running PrimeSense NITE’s Skeleton Tracking inside XCode on your OSX. It will also help you stream that data in case you’d like to use it in another environment such as Max. An example Max patch is also available.

PrimeSense NITE Skeletonization and Motion Capture to Max/MSP via OSC from pkmital on Vimeo.

Prerequisites:

0.) 1 Microsoft Kinect or other PrimeSense device.

1.) Install XCode and Java Developer Package located here: https://connect.apple.com/cgi-bin/WebObjects/MemberSite.woa/wa/getSoftware?bundleID=20719 – if you require a Mac OSX Developer account, just register at developer.apple.com since it is free.

2.) Install Macports: http://www.macports.org/

3.) Install libtool and libusb > 1.0.8:

$ sudo port install libusb-devel +universal

4.) Get the OpenNI Binaries for Mac OSX: http://www.openni.org/downloadfiles

5.) Install OpenNI by unzipping the file OpenNI-Bin-MacOSX (-v1.0.0.25 at the time of writing) and running,

$ sudo ./install.sh

6.) Get SensorKinect from avin2: https://github.com/avin2/SensorKinect/tree/unstable/Bin

7.) Install SensorKinect by unzipping and running

$ sudo ./install.sh

8.) Install OpenNI Compliant Middleware NITE from Primesense for Mac OSX: http://www.openni.org/downloadfiles

9.) Install NITE by unzipping and running

$ sudo ./install.sh

When prompted for a key, enter the key listed on the openni website.

Getting it up and running:

1.) Download the … Continue reading...

Responsive Ecologies Exhibition

December 1, 2010
In audio-visual, computer vision, technology
No comments yet

Come checkout the Waterman’s Art Centre from the 6th of December until the 21st of January for an immersive and interactive visual experience entitled “Responsive Ecologies” developed in collaboration with artists captincaptin. We will also be giving a talk on the 10th of December from 7 p.m. – 9 p.m. during CINE: 3D Imaging in Art at the Watermans Center.

Responsive Ecologies is part of a wider ongoing collaboration between artists captincaptin, the ZSL London Zoo and Musion Academy. Collectively they have been exploring innovative means of public engagement, to generate an awareness and understanding of nature and the effects of climate change. All of the contained footage has come from filming sessions within the Zoological Society; this coincidentally has raised some interesting questions on the spectacle of captivity, a issue which we have tried to reflect upon in the construction and presentation of this installation. The nature of interaction within Responsive Ecologies means that a visitor to the space cannot simply view the installation but must become a part of its environment. When attempting to perceive the content within the space the visitor reshapes the installation. Everybody has a degree of impact whether directed or incidental, and … Continue reading...

6DOF Head Tracking

November 18, 2010
In computer vision, technology, visual cognition
No comments yet

The following demo works with SeeingMachines FaceAPI in openFrameworks controlling a Mario avatar. It also has some really poor gesture recognition (and learning but it’s not shown here), though a threshold on the rotation DOF would have produced better results for the simple task of looking up/down left/right gestures.

6DOF Head Tracking from pkmital on Vimeo.

interfacing seeingmachines faceapi with openFrameworks to control a 3D mario avatar

This is just with the non-commercial license. The full commercial license (~$3000?) gives you access to lip/mouth tracking and eye-brows, as well as much more flexibility in how to use their api with different/multiple cameras and accessing image data.

Of course, there are other initiatives at producing similar results. Mutual information based template trackers, for instance, seem to be state-of-art. Take a look at recent work by Panin and Knoll using OpenTL:

I imagine a lot of people would like this technology.… Continue reading...

Keyframe based modeling

November 18, 2010
In computer vision, technology
2 comments

Playing with MSERs in trying to implement an algorithm for feature-based object tracking. The algorithm first finds MSERs, warps them to circles, describes them with a SIFT descriptor, and then indexes keyframes of sift vectors by using vocabulary trees. Of course that’s a ridiculously simplified explanation, but look at what it’s capable of!!!:

Microsoft Kinect

This is big. In less than a week, the Kinect has been hacked and ported for windows, osx, linux, java and processing, max/msp (almost), and flash…

Much much more to come: … Continue reading...

“Memory” Video @ AVAF 2010

July 25, 2010
In audio-visual, computer vision, memory, technology
No comments yet

Please rate, share, and comment!

Memory @ AVAF 2010 from pkmital on Vimeo.

‘Memory’ is an augmented installation of a neural network by Parag K Mital & Agelos Papadakis.
hand blown glass, galvanized metal chain, projection, cameras; 1.5m x 2.5m x 3m

Ghostly images of faces appear as recorded movie clips within neural-shaped hand-blown glass pieces. As one begins to look at the neurons, they notice the faces as their own, trapped as disparate memories of a neural network.

Filmed and installed for the Athens Video Art Festival in May 2010 in Technopolis, Athens, Greece. The venue is a disused gas factory converted art space.

Also seen at Kinetica Art Fair, Ambika P3, London, UK, 2010; Passing Through Exhibition, James Taylor Gallery, London, UK, 2009; Interact, Lauriston Castle, Edinburgh, UK, 2009.

Facebook Graph API

July 19, 2010
In computer vision, technology, visual cognition
No comments yet

If you are one of the +500 million users of facebook, and you know your user id, try plugging it in here: http://zesty.ca/facebook/

This uses the Facebook Graph API to get information about Facebook users in a very accessible manner. Of course, it is only your “public” information that is accessible without authorization. But once you “allow” an application to access your information, you’re allowing access to EVERYTHING.

Generally, these items are publicly known:

{
   "id": "0123456789",
   "name": "Parag K Mital",
   "first_name": "Parag",
   "middle_name": "K",
   "last_name": "Mital",
   "locale": "en_US"
}

and also your Profile picture.

Check out a montage of the first 3600 Facebook user’s profile pictures, obtained just by using the public url: http://graph.facebook.com/USER_ID/picture

And an image of the average of all 3600 profile images:

Dynamic Scene Perception Eye-Movement Data Videos and Analysis

Over the past 2 years, I have been working under the direction of Prof John M Henderson together with Dr Tim J Smith and Dr Robin Hill on the DIEM project (Dynamic Images and Eye-Movements). Our project has focused on investigating active visual cognition by eye-tracking numerous participants watching a wide-variety of short videos.

We are in the process of making all of our data freely available for research use. As well, we have also worked on tools for analyzing eye-movements during such dynamic scenes.

CARPE, or more bombastically known as Computational Algorithmic Representation and Processing of Eye-movements, allows one to begin visualizing eye-movement data together with the video data it was tracked with in a number of ways. It currently supports low-level feature visualizations, clustering of eye-movements, model selection, heat-map visualizations, blending, contour visualizations, peek-through visualizations, movie output, binocular data input, and more. The videos shown above on our Vimeo page were all created using this tool. Head over to Google code to check out the source code or download the binary. We are still in the process of stream-lining this process by creating manuals for new users and uploading more of the eye-tracking and video data so … Continue reading...

Video of Memory and ChaoDependant @ Kinetica Art Fair 2010, London UK

April 25, 2010
In audio-visual, computer vision, memory
No comments yet

Augmented Sculpture Project

This will be my second year supervising the Digital Media Studio Project at the University of Edinburgh. The course is a mix of over 60 Digital Composition, Sound Design, Digital Design in Media, and Acoustic and Music Technology MSc students. 10-15 supervisors pitch a project proposal and the students decide which ones they’d like to participate in. This year, I proposed Augmented Sculpture, and 3 students signed up of which 2 are Sound Designer and 1 is a Digital Designer. So far, they have managed to communicate tracking data via a reactivision framework and combine life-sized sculpture to interact with a sonic environment built in Max/MSP.

Chandan, Helen and Ev playing with a ReacTIVision controlled Max/MSP patch developed for the Digital Media Studio Project at Edinburgh University. This is the very first ever test run of the system, and it worked!

Follow more developments on their blog.… Continue reading...