sness: Experiments with FFTs

Thursday, July 09, 2009

Experiments with FFTs

I'm writing a new Flash interaction tool for The Orchive. The original one was fine, and played audio, but since the recordings of Orcas are all 45 minutes long, it was impossible to navigate through the recording to find the orca calls.

So, I wrote orcaannotator, a new Flash tool written with haXe that let's you zoom in right down to individual calls, which last for only a second or two.

In doing so, I needed to generate huge images of the waveforms and spectrograms, these would then be chopped up by a tool written in Ruby using ImageMagick and given to the Flash application.

To actually generate the images, I used Marsyas and wrote a program called sound2png that takes an audio file and outputs a PNG containing a waveform or a spectrogram.

I had lots of problems with the spectrogram code, I just couldn't get the spectrograms to display like they do in Audacity. I had tried all sorts of different kind of scaling algorithms, and just nothing was working. I was searching all around the web for different contrast curve algorithms, and nothing that I tried worked. My super smart Mom suggested that I look in the source code of Audacity. I'd looked in it a bit before, but had given up, but buoyed by her suggestion, I tried again. The code was actually much easier to understand than I'd remembered, and quickly was able to find the code that did this, by first looking in WaveClip.h in the WaveClip::GetSpectrogram() function, which led me to Spectrum::ComputeSpectrum(). It turns out that they weren't doing anything fancy, just taking the decibels (10*log10()) of the Power Spectrum. This was a bit strange, because that was just what I was doing in Marsyas. I looked a bit more closely at the Marsyas code, and realized that it was actually doing 20*log10() and cutting off anything that was below -100.

This was totally my problem, all the low end data was getting truncated, and no amount of post-scaling that I could do would fix it! I fixed Marsyas, and am now generating sweet looking spectrograms.

Right now, I'm just trying to figure out the best window size to use when doing my spectrograms. I chopped out 45 seconds of orca song, and tried a number of different window sizes on them. Here's the command I used:


sound2png small.wav -mf 10000 -v -ws 2048 -hs 1024 out.png

Which gives a maximum frequency of 10,000 Hz, a window size of 2048 and a hop size of 1024. The window size of 2048 gives back a Power Spectrum with 1024 bins, and the hop size of 1024 says that each of those windows will overlap by half their width.

I tried this with a number of different window sizes to get a bunch of images, which I then scaled to the same size with the following ImageMagick command:


convert -resize "989x160\!" out_2048_1024_num2.png out_2048_1024_num2.jpg

Window size	Image
512
1024
2048
4096
8192

I think that 2048 gives the best image...