# DIY 3D Scanner Via Inverse Square Law

When you look around the market, there are many different types of 3D scanners and 3D sensors out there. A quick look at PCL’s IO library lists a number of different sensors out there.  Microsoft has their Kinect which use Time Of Flight to measure depth. The PR2 robot uses projected stereo mapping (i.e. It uses a projector to project a pattern onto the floor then it uses visual cues from the projected pattern to reconstruct the depth of the pattern). Then there are laser range finders which use rotating mirrors to scan depth. Some sensors use multiple camera like we use our two eyes to estimate depth. All of these methods have their own problems. Measuring time of flight is challenging and expensive, rotating mirrors with laser are slow and unsafe, stereoscopic methods are still quite inaccurate. Enter 3D Scanning Via Inverse Square Law. This is a new technique that I have hacked together using one LED and a camera. It is somewhat like projected stereo mapping but simpler and less computationally expensive.

### Theory

So how does it work? In a traditional time of flight imaging system a pulse of light is sent out and the time it takes for the reflection to come back is used to gauge the distance. Light is very fast so the time difference for most objects is so minuscule that it makes the electronics very complicated. There is something else that changes with time over a distance – Power. The further we move away from a light source the less light we receive. This is characterised by the inverse square law.

$P_{detected} = \frac{P_{source}}{r^2}$

Since power is very easy to measure, in theory it should be quite easy to measure the distance from the source using this technique. The challenge arises because the brightness of the source is not constant. Not all materials reflect light in the same way. If we were to imagine a set up like the one below:

Let us assume that the light source emits light at a power of $L$.

Then the object receives $\frac{L}{P^2}$  amount of light.

If its albedo is $a$ then the light reflected back is:

$\frac{aL}{P^2}$

According to the inverse square law the light that reaches the camera is:

$\frac{aL}{Q^2P^2}$

In most cases the $P \approx Q$. Therefore the above equation can be rewritten to:

$\frac{aL}{P^4}$

Now we can control $L$ by modulating the brightness of the light source. Thus we are left with a system of simultaneous equations:

$\frac{aL_1}{P^4}$

$\frac{aL_2}{P^4}$

These equations can be solved quite easily by using images of two different brightness. If we solve for $P$ we will get the distance between the object and the camera.

### Practical Implementation

In theory this all looks very good so how do we actually implement it? My hardware looks like this:

As you can see theres a blue LED, an Arduino and Logitech webcam. The arduino controls the brightness of the LED and the webcam observes. The code for the arduino is as follows:

https://create.arduino.cc/editor/arjo129/b42a42e1-00fd-499b-bd9f-c1915d9eea61/preview?embed

I know theres an analog write function but I decided to be fancy about it and do the timing myself.

The app itself is available on github: https://github.com/arjo129/LightModulated3DScan

It is written in C++ using the Qt framework and OpenCV (I’ve hard coded the path to opencv, you’ll need to change it). Theres still a lot of work to be done to get it into production quality.

Here is my first scan:

You can see the object has been digitised quite well.

Here is another scan of my mom’s glasses and a tissue box in the backdrop. The actual scan was done in broad daylight.

The code isn’t perfect. It currently uses unsigned chars to deal with the computation. This leads to the problem of saturation. You can see this problem on the hinge of the glasses because of the proximity of the glasses. Another image which makes this problem even clearer is this one:

Here you can see that the parts nearest the light source and with the highest albedo are over saturated. This problem would be solved if I transitioned the code to floating point arithmetic as opposed to unsigned 8 byte integers.

# How µSpeech works

Its been over a year since I posted anything. This does not mean that I was not doing anything, but rather that I was working on other stuff (and also the Chinese government has blocked WordPress and I only got a VPN recently). One of the things that I had been working on was µSpeech, a speech recognition software for the Arduino. This had originally seemed crazy as speech recognition was a very computationally demanding process. Clocked at a couple of megahertz and with Kilobytes of RAM, the Arduino could not afford to use a standard speech recognition algorithm.

Most speech recognition algorithms involve the use of a process known as the Fast Fourier Transform (FFT). For those who are unaware, the FFT is a process which takes a sound and splits it into its constituent frequencies. Now, the FFT is not something that is particularly easy to do. In fact contrary to its name, it is an extremely slow process. The innovation in µSpeech is that it bypasses this process – at a cost: µSpeech is only able to differentiate between fricatives and voiced fricatives. Its ability is therefore limited, but it is good enough for being able to differentiate between commands such as “Left”, “right”, “Forward” and “Backward.”

The thing about fricatives (such as: /f/, /s/, /sh/) is that if you touch your throat, you realize that the vocal chords play no role in making these sounds. This means that these sounds are made entirely by the mouth and the air coming out of it. The key here is that this means that these sounds have an inherent tendency to be more like noise and have higher frequencies. If you were to look at a graph plotting the air pressure over time, the sounds of /s/ has a very chaotic graph that zigzags a lot which is not the case with the sound of /a/. Thus I found that the following formula works well:

$\large{ c = \frac{\sum |\frac{df(x)}{dx}|}{\sum |f(x)|}}$

Letters such as /s/ result in a very high value of c, where as letters such as /a/ result in a low value. Voiced fricatives such as /v/ result in a value that is just in between.

I found that the value for c falls within a certain range depending on the letter (and microphone). Thus when you calibrate µSpeech, you are essentially tweaking the threshold values. It generally takes a full afternoon to get them right!