Self made Computer Vision Researcher: 2017-05-07

Wednesday, 10 May 2017

Object Detector using SVM and HOG

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words given labeled training data, the algorithm outputs an optimal hyperplane which categorizes new examples.

While there exists many planes separating the given data, SVM outputs the optimal hyperplane dividing the two dataset. The optimal hyperplane gives the largest minimum distance to the training examples.

Given a hyperplane how will we make a decision rule so as to use this decision boundary?

Lets consider a vector W which is perpendicular to the hyperplane starting from origin. Let U denote the vector joining the origin and the unknown point. Lets consider a projection of this unknown vector U on W given my U.W
If this projection is small the unknown point will lie in one class and else it will lie on the other class.
i,e if U.W + B>0 => it is a positive sample
   if U.W + B<0 => it is a negative sample.

But this does alone does not give us enough information to find out W and B.
Lets think of a way to do that.
Let X+ denote a positive sample and X- denote a negative sample.

Lets assume that W.X+ + B >= 1
Also W.X- +B <= -1

Let us denote Yi = +1 for positive samples and -1 for negative samples
Multiplying the above two equations with Yi gives
Yi(W.X+ B)>=1
Yi(W.X+ B)>=1   both the equations turns out to be the same because of introducing the extra variable.

Yi(W.X+B) -1 >=0

Yi(W.X+B) -1 =0 for the nearest points to the decision plane.   ---> 1

Consider the two data points X- and X+ lying closest to the decision boundary.
The width of the decision boundary is given by (X+ - X-).W/||W||

From Equation 1 it simplifies to 2/||W||

Our aim is to maximize this width and hence minimize ||W|| with the above constraints.
Its is also equivalent to minimizing .5||W||^2

Solving with the help of Lagrange Multipliers . On solving we get
W=Sum (alpha*Yi*Xi) ----> (2)
Sum(alpha*Yi) = 0 --->(3)
These equations can be passed into an optimization frame work we can get the optimal values of W and B

Assuming that we have some data which is not linearly separable, we need a transformation thats makes the data linearly separable.There are various kernels that can be used for performing such a transformation.

Now lets talk about something different . Let discuss a feature descriptor procedure called Histogram of Oriented Gradients (HOG). HOG feature descriptors are very frequently used in object detection and today we will be using HOG along with SVMs to Object Detection.
First of all lets begin by asking the question , what is a feature descriptor?
A feature descriptor is a representation of the image by taking only useful information from an image for a particular task. Not that its not a compression technique as we will not be able to recreate an image from its descriptors.

Procedure to calculate the HOG of an Image
1. Calculate the gradient image using the Sobel X and Sobel Y operators
2. Divide the image into small cells (eg: 8*8). Calculate the Histogram of Gradients (based on direction and magnitude of the gradient).
3.Take an average over the Histogram over adjacent neighbor cells in order to obtain a better shape property.
4.Concatenate all the feature vectors to obtain a single dimensional vector showing containing the HOG feature.

The resulting feature vector of various images can be trained using a SVM to obtain a classifier in order to distinguish between various images.

I have used HOG features along with SVM to detect Treble clefs in a Sheet Music.
There were hardly any false positives but some clefs were not detected. I hope to resolve it either by using more data or building multiple classifiers.

The image shows one of the most perfect results obtained by the program. I hope to use this code to make a complete OMR system.

Tuesday, 9 May 2017

Hough Transform

Today I will be explaining a very popular algorithm in the domain of Computer Vision known as the Hough Transform.I will start by explaining how the algorithm works and then I would be showing a sample implementation using this algorithm for detection of straight lines in Sheet Music.
So lets get started
What does Wikipedia tells you?
The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.

What is meant by a parameter space?

Let me explain it by considering the case of a line.
In 2D the equation of a line can be written as
y=mx+c

Here the parameters of the line are m and c as a different
values of the parameters will give a different line in the XY plane

lets rewrite this equation as c=-mx+y
Now we can visualize the same as point in MC plane with x and y as the parameters.

This is known as the parametric space in Hough transform.
So what is the advantage of using such a space ?

Each point in an XY plane can be mapped to a line in the parametric domain and each point in the parametric space can be mapped into a line in the actual XY domain. So consider two points in the XY plane . Consider the corresponding lines in the parametric space.The point of intersection gives the M and C values for the equation of the line joining these two points.This is the main idea used behind all the Hough transform. But there is a problem in this case of vertical lines. Hence for obtaining a generalized solution we use another parametric representation of a line

i.e r=Xcos(theta) + Ysin(theta)

Now our task is to figure out if there is a line in a given image. Assume that the image is a binary images with either zeros or ones. For each point in the image we can draw a line in the parametric space.

We can define the parametric space with a double dimensional array and increment the those cells through which a line passes. If the points do in fact lie in a line, then the cell corresponding to that parameter (r,theta)will have a huge value compared to other cells . Hence we can get the exact equation of the line.

The same technique can be extended for any curve which can be represented in a parametric way. Thus we can find if some points do in fact lie on a particular curve.

Lets analyze the complexity of this method. Consider that we have N points . For each point we have to update a set of points in the parameter space. But we can choose to to much we have to split the parametric space and this can be under our control. Finally we have to find the maximum in a two dimensional array to obtain the required parameters. Hence the overall complexity is of O(N) . The constants can be quite large though.

This is the result of applying hough line transform on a sheet music. Even though there are multiple lines for a particular stave line those can be avoided by using a clever usage.

Thanks for reading , and have a great day

Monday, 8 May 2017

Expectation Maximization Algorithm

An expectation maximization is an iterative method to find maximum likelihood of parameters in statistical models. The EM algorithm is very similar to that of K-Means but uses a probability value instead of a hard cutoff. Today I will be showing on how to use EM algorithm to initialize a Mixture of Gaussian (MoG) . Specifically I would be using the EM algorithm to categorize the histogram of an Image as a MoG. For simplicity assume that the image histogram can be approximated by two Gaussian distributions.

For Instance consider the following figure. The histogram of this image is plotted below

The histogram shows a peak for the background as well as a peak for the skin color near 200. These two distributions can be approximated as a Gaussian and EM algorithm can be used to calculate the means and the variances of these distributions.

Once the means and the variances are computed those values can be used to threshold a particular color alone . Closed contours can also be used to extract face alone from the thresholded region. An implementation of EM algorithm from OpenCV's ml module was used directly and the resulting means were found at 0 and 136. The resulting standard deviation was found to be 0 and 83. This abnormal value of standard deviation indicate that the distribution cannot be modeled properly a Mixture of Gaussian with two modes.

So an attempt was made to model the distribution with three modes.The means were obtained at 200,46 and 0 with a Standard Deviation of 22.4,0 and 46.55 . This is a much more better fit for the histogram and hence 3 modes where taken.

Now lets try thresholding the image by a threshold given by mean-S.D=200-22=178

The white region corresponds to the skin color in the initial image.

Typically we would not be using EM algorithms for solving these problems since the number of points are quite large. Still this example illustrates the power of EM algorithm

The associated code can be found here

Sunday, 7 May 2017

Calculating Extrinsic Camera Matrix

Most people doing research in Computer Vision would be aware of the Camera Matrix's .i,e the Intrinsic and the Extrinsic Matrix. There is a huge amount of code available which calibrates the intrinsics matrix from chess board images. So given a new camera its quite easy to calibrate the intrinsic matrix. Extrinsic Matrix on the other hand vary with individual images and contains the details of the position and orientation of the camera.
Given this much information I always felt it difficult on how to exactly calculate the camera Extrinsic provided we know the exact orientation and pose of the camera in the actual world. Today I found this article which was extremely useful for me to understand this matrix properly .

Let C be a column vector describing the location of the camera-center in the world coordinates.Let Rc be the rotation matrix denoting the cameras orientation with respect to the world coordinates.

Using these relationships we can compute the camera extrinsic matrix .

Camera Extrinsic Matrix are required in creating Multi-view Stereo Datasets