Rotation Invariant Histogram Of Gradients Via DFT

I have successfully used HoG for a number of object recognition tasks. In fact I have found it to be a ridiculously good image descriptor particularly for rigid bodies. It can easily differentiate between various objects given very few training samples and is much faster than any of the existing object detection algorithms. While I know the current trend among researchers is toward CNNs, I also feel that CNNs are quite computationally expensive and require extensive training as opposed to HoGs which can be easily implemented on embedded devices.

One of the major problems with HoG though is that HoG is not rotation invariant. The naïve way of solving this problem is by shifting the histogram bins one by one then measuring the distance between bins. Naturally, this is a very expensive process on a large scale. The time complexity of this would be O(n^2). Naturally it would be more convenient to find something that performs the same task in a faster manner.

The task of shifting the histogram bin by bin smelled awfully like correlation – in fact it is correlation. One of the cool properties of correlation is that it has a nice mathematical property:

Let F(x) be the Fourier transform of x.

\star is the correlation operator.

a\star b = F^{-1}(F(x)^{*}F(x))

So lets try it.

I generated a random histogram:

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline 
arr = np.random.rand(20)


Next I wrote a shift function:

def shift_array(array,quantity):
     shifted_array = [0.0]*len(array)
     i = 0
     while i < len(array):
         shifted_array[i-quantity] = array[i]
     return shifted_array
arr_shifted = shift_array(arr,3)


Now lets try the trick:

arr1 = np.fft.rfft(arr)
arr2 = np.fft.rfft(arr_shifted)
output = np.fft.irfft(np.conj(arr1)*arr2)


Lets try on arr_shifted:

arr_shifted2 = shift_array(arr,9)
arr1 = np.fft.rfft(arr)
arr2 = np.fft.rfft(arr_shifted2)
output2 = np.fft.irfft(np.conj(arr1)*arr2)


Notice the peak shifted! This could well tell us how much shift there is. Lets check to see:


This gives us peaks at  9 and 3. Exactly the amount we shifted!

But how about comparison? While we can determine the angle of best fit, how do we know whether its actually a fit? Lets take a look:

arr_diff = np.random.rand(20)
arr1 = np.fft.rfft(arr)
arr2 = np.fft.rfft(arr_diff)
output2 = np.fft.irfft(np.conj(arr1)*arr2)


Notice how there is no clear peak. One could use simple mean and stdev to show that there is no strong peak. This would in effect give us an estimate for how strongly correlated the values are.

DIY 3D Scanner Via Inverse Square Law

DIY 3D Scanner Via Inverse Square Law

When you look around the market, there are many different types of 3D scanners and 3D sensors out there. A quick look at PCL’s IO library lists a number of different sensors out there.  Microsoft has their Kinect which use Time Of Flight to measure depth. The PR2 robot uses projected stereo mapping (i.e. It uses a projector to project a pattern onto the floor then it uses visual cues from the projected pattern to reconstruct the depth of the pattern). Then there are laser range finders which use rotating mirrors to scan depth. Some sensors use multiple camera like we use our two eyes to estimate depth. All of these methods have their own problems. Measuring time of flight is challenging and expensive, rotating mirrors with laser are slow and unsafe, stereoscopic methods are still quite inaccurate. Enter 3D Scanning Via Inverse Square Law. This is a new technique that I have hacked together using one LED and a camera. It is somewhat like projected stereo mapping but simpler and less computationally expensive.


So how does it work? In a traditional time of flight imaging system a pulse of light is sent out and the time it takes for the reflection to come back is used to gauge the distance. Light is very fast so the time difference for most objects is so minuscule that it makes the electronics very complicated. There is something else that changes with time over a distance – Power. The further we move away from a light source the less light we receive. This is characterised by the inverse square law.

 P_{detected} = \frac{P_{source}}{r^2}

Since power is very easy to measure, in theory it should be quite easy to measure the distance from the source using this technique. The challenge arises because the brightness of the source is not constant. Not all materials reflect light in the same way. If we were to imagine a set up like the one below:

Let us assume that the light source emits light at a power of L.

Then the object receives  \frac{L}{P^2}  amount of light.

If its albedo is a then the light reflected back is:


According to the inverse square law the light that reaches the camera is:


In most cases the P \approx Q. Therefore the above equation can be rewritten to:


Now we can control L by modulating the brightness of the light source. Thus we are left with a system of simultaneous equations:



These equations can be solved quite easily by using images of two different brightness. If we solve for P we will get the distance between the object and the camera.

Practical Implementation

In theory this all looks very good so how do we actually implement it? My hardware looks like this:imag0504

As you can see theres a blue LED, an Arduino and Logitech webcam. The arduino controls the brightness of the LED and the webcam observes. The code for the arduino is as follows:

I know theres an analog write function but I decided to be fancy about it and do the timing myself.

The app itself is available on github:

It is written in C++ using the Qt framework and OpenCV (I’ve hard coded the path to opencv, you’ll need to change it). Theres still a lot of work to be done to get it into production quality.

Here is my first scan:

You can see the object has been digitised quite well.

Here is another scan of my mom’s glasses and a tissue box in the backdrop. The actual scan was done in broad daylight.


The code isn’t perfect. It currently uses unsigned chars to deal with the computation. This leads to the problem of saturation. You can see this problem on the hinge of the glasses because of the proximity of the glasses. Another image which makes this problem even clearer is this one:


Here you can see that the parts nearest the light source and with the highest albedo are over saturated. This problem would be solved if I transitioned the code to floating point arithmetic as opposed to unsigned 8 byte integers.

µSpeech 4.0 coming soon with support

The µSpeech library is almost ready to be deployed as version 4.0. A number of bug fixes have been performed and new features have been added. Among the new features is a way to store and compare words. It also makes it significantly easier to calibrate the library. Among other things I have augmented the documentation with a video for calibrating the latest version of the library. If you are interested in trying it before hand:

git clone -b 4.0-workingBranch

UPDATE: 4.0 is now mainstream, just go to the downloads page to download the latest version.

This is how to calibrate the phoneme recognizer:

The new API Docs will be coming up soon. I am in the process of getting it updated at so those of you who use it to program your arduino can enjoy the benefits of µSpeech

Complete list of changes:

  • Update code bender support.
  • Fix vowel detection.
  • Package in Arduino IDE friendly format.
  • Video documentation.
  • Improve ease of use.
  • New API for easy word recognition

How µSpeech works

Its been over a year since I posted anything. This does not mean that I was not doing anything, but rather that I was working on other stuff (and also the Chinese government has blocked WordPress and I only got a VPN recently). One of the things that I had been working on was µSpeech, a speech recognition software for the Arduino. This had originally seemed crazy as speech recognition was a very computationally demanding process. Clocked at a couple of megahertz and with Kilobytes of RAM, the Arduino could not afford to use a standard speech recognition algorithm.

Most speech recognition algorithms involve the use of a process known as the Fast Fourier Transform (FFT). For those who are unaware, the FFT is a process which takes a sound and splits it into its constituent frequencies. Now, the FFT is not something that is particularly easy to do. In fact contrary to its name, it is an extremely slow process. The innovation in µSpeech is that it bypasses this process – at a cost: µSpeech is only able to differentiate between fricatives and voiced fricatives. Its ability is therefore limited, but it is good enough for being able to differentiate between commands such as “Left”, “right”, “Forward” and “Backward.”

The thing about fricatives (such as: /f/, /s/, /sh/) is that if you touch your throat, you realize that the vocal chords play no role in making these sounds. This means that these sounds are made entirely by the mouth and the air coming out of it. The key here is that this means that these sounds have an inherent tendency to be more like noise and have higher frequencies. If you were to look at a graph plotting the air pressure over time, the sounds of /s/ has a very chaotic graph that zigzags a lot which is not the case with the sound of /a/. Thus I found that the following formula works well:

\large{ c = \frac{\sum |\frac{df(x)}{dx}|}{\sum |f(x)|}}

Letters such as /s/ result in a very high value of c, where as letters such as /a/ result in a low value. Voiced fricatives such as /v/ result in a value that is just in between.

I found that the value for c falls within a certain range depending on the letter (and microphone). Thus when you calibrate µSpeech, you are essentially tweaking the threshold values. It generally takes a full afternoon to get them right!