How µSpeech works

Its been over a year since I posted anything. This does not mean that I was not doing anything, but rather that I was working on other stuff (and also the Chinese government has blocked WordPress and I only got a VPN recently). One of the things that I had been working on was µSpeech, a speech recognition software for the Arduino. This had originally seemed crazy as speech recognition was a very computationally demanding process. Clocked at a couple of megahertz and with Kilobytes of RAM, the Arduino could not afford to use a standard speech recognition algorithm.

Most speech recognition algorithms involve the use of a process known as the Fast Fourier Transform (FFT). For those who are unaware, the FFT is a process which takes a sound and splits it into its constituent frequencies. Now, the FFT is not something that is particularly easy to do. In fact contrary to its name, it is an extremely slow process. The innovation in µSpeech is that it bypasses this process – at a cost: µSpeech is only able to differentiate between fricatives and voiced fricatives. Its ability is therefore limited, but it is good enough for being able to differentiate between commands such as “Left”, “right”, “Forward” and “Backward.”

The thing about fricatives (such as: /f/, /s/, /sh/) is that if you touch your throat, you realize that the vocal chords play no role in making these sounds. This means that these sounds are made entirely by the mouth and the air coming out of it. The key here is that this means that these sounds have an inherent tendency to be more like noise and have higher frequencies. If you were to look at a graph plotting the air pressure over time, the sounds of /s/ has a very chaotic graph that zigzags a lot which is not the case with the sound of /a/. Thus I found that the following formula works well:

\large{ c = \frac{\sum |\frac{df(x)}{dx}|}{\sum |f(x)|}}

Letters such as /s/ result in a very high value of c, where as letters such as /a/ result in a low value. Voiced fricatives such as /v/ result in a value that is just in between.

I found that the value for c falls within a certain range depending on the letter (and microphone). Thus when you calibrate µSpeech, you are essentially tweaking the threshold values. It generally takes a full afternoon to get them right!