Visualizing Repetition in Lyrics

Today I stumbled upon Visualizing Repetition in T.S. Eliot’s Four Quartets . Will Kurt  devised a simple and brilliant method to visualize repetition structure in poems and lyrics. And that got me wondering, how would lyrics of Hindi songs look in this visualization?

They  have so much repetition and so many different repeating patterns in the lyrics I was pretty sure I  would find something interesting. So, I coded a similar and  simpler version (visualize-lyrics).

What will the visualization look like?
You’ll be seeing a grid for each song. Each square in the grid visually  represents the similarity between two lines of the song. The bluer the  square is, the more similar the two lines are. Maroon means the lines  are dissimilar. A color legend is attached with each image for  clarification.

What patterns can you look for?
Repeating  patterns in blue correspond to repeating patterns in the song. More  maroon means the lyricist used lots of different words and did not  adhere to a lyrical structure. Remember the grid will be symmetric about  the diagonal. So, don’t look for that.

Let’s get started.

1. Rim Jhim Gire Saawan

A short and melancholic song. It returned the following:

The lyrics Rim jhim gire saawan, Sulag sulag jaaye mann is shown by the repeating blues. I daresay the grid resembles falling raindrops which is also the visual imagery created by the lyrics. Most probably just a coincidence.

2. Ek Ladki Bheegi Bhaagi Si

Here’s another song set against the backdrop of rain. And its visualization:

Again falling raindrops! Am I onto something here? On a more serious note (Inserted music pun. Score!),  this is a longer song and the lyrical structures are very evident.  Heck, the structures are almost symmetric about themselves too. That’s  sheer brilliance.

3. Ek Ladki Ko Dekha Toh

I chose this song on purpose. It has too many repetitions.

So  many that maroon can’t be assigned as the background color. And the  number of big squares formed shows how this song’s music is also  structured.

4. Emotional Atyachar

I included this because the repeating structure in this song develops around the middle and by the end we are only left with Tauba tera jalwa… emotional atyachar. So the visualization gives us this:

5. Husna

This is a very powerful song whose lyrics don’t give in to direct repetition to get lyrical structure.

The  lack of blue squares and increased number of yellow squares shows how  the lyricist follows a certain structure but does not repeat lines  altogether.

Speaking  of repeating lines altogether, one cannot ignore the contribution of a  certain Mr. Himesh Reshammiya and his lyricists. As an apt ending to  anything about repeating lyrics, I attach this masterpiece.

And its cosine-similarity graph.

The  repeating pattern is so damn beautiful. Lots of descending electric  rays, seeking the deepest crevices on earth, in the hope that the ocean  will muffle the song that created them.


Discovering structure in a video using bag-of-words

I had worked on a project last year on using space time interest points and HOG/HOF features to classify human-human interaction in videos using the bag-of-words approach. I learnt a lot about how to use feature matching in classification. Although overall a decent accuracy was achieved, the project left me asking this question: can we ever truly discover structure present in a video using the bag-of words method?

As soon as we cluster the HOF features we are eliminating information in the time dimension as in it, it no longer matters whether the feature occurred in an earlier frame or later frame. It’s not totally a sham. It still manages to capture local(this frame and the next in time, neighbouring pixels in space) information but is this enough to get structure of the entire action performed? Clustering HOG/HOF features effectively gets rid of any temporal or spatial structure.

So why does the method still give good results?  Mostly because there are enough features  to classify or differentiate between actions. But I don’t think if we are to store representations of actions bag-of words is the way to go. It can be effective in a problem setting where differentiating between actions is important but not when you need to discover structure in a video that isn’t local.

Natural Language Processing in CV

Most problems in computer vision involve a step of dimensionality reduction – where you go down from a  high dimensional feature space to a low dimensional feature space. Two approaches from NLP have found great application in CV :

1) Bag of Words 2)Probabilistic latent semantic analysis

While bag of words is a model for supervised learning, probabilistic learning is its counterpart for unsupervised learning.

Bag of Words

Bag of words has been effectively used in object and action recognition. Unlike language you don’t have an existing dictionary of words that can be used directly in   bag of words approach. So usually features like SIFT, HOG, HOF(Histogram of Optical Flow), SURF etc are clustered to get a dictionary. Now this dictionary of “visual words” can be used to the bag of words model.

There are some subtle details that are left to experimentation like the number of “visual words” in the dictionary, the distance metric to be used while clustering etc. Euclidean distance though commonly used is not a great metric for CV. For example, if you are clustering regions of similar colour, violet is closer to red than to green. This can be dealt with by changing into another colour space which models human vision better. This provides a good comparison of many commonly used colour spaces.

Once clustering is done, one can train SVM on the occurrence of these “visual words” in other images/videos  to get various classifiers that can be used for object and action recognition.

Probabilistic Latent Semantic Analysis

pLSA is a statistical technique for the analysis of two-mode and co-occurrence data. In effect, one can derive a low dimensional representation of the observed variables in terms of their affinity to certain hidden variables.[Wikipedia]

Instead of documents the algorithm operates on images, the words are substituted by visual words, and the topic is represented by a category of an object. Th initial steps of getting “visual words” is similar to that described in the bag of words model.

The training process of pLSA yields the probabilities P(z|i) and P(w|z); using P(z|i) for each image i. In the above probabilities, z=object category,i=image,w=visual word. The images were classified as containing object k if it is maximum of all probabilities of  P(z|i) among all object categories present.

Glimpses of CV Awesomeness 2

It would be really cool if you can animate the objects you see in your room by “being” the object itself.

So what would you need? No motion capture. No model rigging. A Kinect will suffice.

Microsoft came up with this system KinÊtre that lets you do exactly that.

The problem is that of finding correspondence between the actor and the character graphs and then being able to keep track of the deformation produced by the motion of the actor and propagating it back to the character.  It is a little more involved than that because the character to be animated is a mesh.

This work is analogous to the work done at University of Washington on the project Being John Malkovich. This project enabled an user to control another person’s face in real-time.


Action Recognition and Object Recognition from RGBD Images

You have an RGBD image which is your typical output from the kinect or the xtion pro. One method that can be used to recognize objects is to segment out objects from the given image, recreate an approximate 3D model of the object and match it with some pre-existing models.

An interesting parallel can be drawn between recognizing objects in 3D and recognizing actions from videos.

In both cases we deal with 3 dimensions. In object recognition we deal with x,y(image plane coordinates) and z (depth) and in action recognition we deal with x,y(image plane coordinates) and t (time). So why not model the action recognition problem simply as an object recognition problem?

Every action is a 3 dimensional shape in the x,y,t coordinate space just as every object is a 3 dimensional shape in the x,y,z space. Hence, a very naive approach would be to make both these problems mathematically equivalent. Firstly, you need to have “learned models” of the actions to be recognized. These learned models can be the contours of actions in the space-time coordinates. Then the task would be to segment out actions from videos and look for the best match in the list. Simple enough right?

However by making these two problems equivalent we are losing out on lots of information. In an RGBD image there is occlusion. So we are forced to build a hollow 3D model forming the contour of the object. But in a video sequence there is no such problem. We need not restrict our selves only to contours while recognizing actions.

More information regarding action recognition can be found here.

Glimpses of CV Awesomeness 1

Say you have an app. You draw this on it.

Input Frame of HorseAnd tell that you want a horse. And it gives you back this.

Next you draw just the frame of a galloping horse and now you generate the shape of a galloping horse.

Other complete shapes(motorbikes etc)  can also be generated from incomplete shapes.

So how is this done? 

The system uses a Boltzmann Shape Machine which is an implementation of a Deep Boltzmann Machine.

And how is this shape/model generation useful?

A good model of object shape is essential in applications such as segmentation, object detection, inpainting and graphics.

More details of this research can be found here.


Computer Vision Resources

Here’s a list of books, libraries, websites, blogs etc that might be useful.

Books on Computer Vision

  1. Programming Computer Vision with Python – One can download the draft and start out with image processing and vision. Theoretically, it touches on a variety of topics from camera models to image segmentation, local image descriptors to multi-view geometry among others and you get to see your code up and running in no time.(Hats off to Python!)
  2. Computer Vision: Algorithms and Applications – Downloadable draft! Good for developing a strong theoretical and mathematical background on a lot of present day research topics. Thumbs up for the number of results/images that are attached for the algorithms described.
  3. Pattern Recognition and Machine Learning – Ok,  this is not a book on vision but is almost essential for anyone who wishes to do machine learning voodoo in vision.
  4. Computer Vision: A Modern Approach – Classic literature on vision.

Books on Libraries

  1. Learning OpenCV: Computer Vision with the OpenCV Library – Good if you are going to be using OpenCV for your project. However, I would suggest learning OpenCV from its tutorials and by actually writing code from scratch.
  2. Making Things See: 3D vision with Kinect, Processing, Arduino, and MakerBot – The title says it all. Has lots of interesting projects that use vision in real-time.



1. – Best set of tutorials for OpenCV.

2. Computer Vision Central Lectures/Tutorials


  1. – Lots of resources,news etc on CV


  1. Coursera’s CV Course
  2. Computer Vision Central Course List

Computer Vision Papers for motivation (or demotivation in case you find someone publishing a paper on something close to your work before you have even collected data)


If you are into research on vision, sooner or later you will be needing standard data-sets to test your algorithms etc.

  1. – The good thing about this page is that they have classified the data-sets into categories.
  2. Computer Vision Central Datasets

Feel free to suggest additions to the above.

Another Beginning

The most pathetic person in the world is some one who has sight but no vision.

A computer is just as pathetic. Just like everything else, it needs to be taught how to see. Hence, the problem of vision and the area of computer vision.

It is often stated in any introductory course on computer vision that the vision problem is:

  • deceivingly easy
  • deceptive
  • computationally demanding
  • critical to many applications

In other words, it is damn interesting.

I will be using this blog as both a log of my own experiments with computer vision, image processing and machine learning and as a collection of all things interesting.