Wednesday, October 26, 2011

Using Scipy To Do Inline FFTs

INSTALLING
You need to do a sudo apt-get install python2.x-dev to get inline to work (I did 2.7). When scipy.weave.inline() runs, there is a call to the C header 'Python.h' which is contained in the dev package.

If there are issues with building the C code, i.e. cryptic errors like "Error Code -1 returned", adding verbose=2 to the arguments of the inline() call will show every line output as it's being built - this is very useful for figuring out why the inline build is breaking!

BACKGROUND
This code pretty much builds on my previous blog post about FFTW. I am a big fan of the speed and simplicity of Python, especially for interacting with the web. However, after my data is fetched from the web, I also want to perform some pretty intense analytics, and even NumPy left something to be desired. After reading this page on heavy computation in Python (http://www.scipy.org/PerformancePython), I found out that inline C can get pretty close to pure C functions, and by interfacing the PyObjects directly, disk I/O can be avoided, which may also save some compute time. And it is also much, much simpler (and faster to code) in Python instead of pure C.

CODES OF NOTE
There are a couple of nice things about inlining these functions. Notice the ['comp', 'num_fft', 'fft_len', 'val'] inside the scipy.weave.inline() call. This exposes those Python objects to the C code, allowing them to be used. I use num_fftlen_fft, and val to set loop sizes and initial values, then after performing the FFTs, the values in out are written to comp using C array indexing. Because the numpy array is declared as complex, the real and imaginary components are combined and written using (re,imag) form.

FINAL THOUGHTS
Here are the statistics I get running two chunks of code against each other, each using dft_plan_many to do 100,000 length 32 FFTs, with FFTW_PLAN set to FFTW_ESTIMATE.

Python
$time ./fftw_plan_many.py real 0m0.738s user 0m0.416s sys 0m0.192s C$ time ./many_dft.o
real 0m0.188s
user 0m0.064s
sys 0m0.088s

Not bad to do 100,000 FFTs, albeit small ones (32 bin) in < 1 sec. However C (like Sebulba) always wins in numerics.

I am planning to continue this investigation thread at a later date. Once my nVidia development computer gets up and running, I will likely attempt to do CUDAFFT with this same interface, and compare that to the current findings.

SIMPLE CODE
#!/usr/bin/python
from scipy.weave import inline

inline( 'printf("Hello World");',
compiler='gcc')
print ""

PYTHON FFTW PLAN MANY CODE
#!/usr/bin/python
import numpy as np
from scipy.weave import inline
from numpy.distutils.system_info import get_info
fftw_plan_many_dft = \
"""
fftw_complex* in;
fftw_complex* out;
fftw_plan plan;

in = (fftw_complex *) fftw_malloc(sizeof(fftw_complex)*fft_len*num_fft);
out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex)*fft_len*num_fft);

plan = fftw_plan_many_dft( 1,
&fft_len,
num_fft,
in,
NULL,
num_fft,
1,
out,
NULL,
num_fft,
1,
FFTW_FORWARD,
FFTW_ESTIMATE );

for (int i = 0; i < num_fft; i++){
val+=1;
for (int j = 0; j < fft_len; j++) {
in[j+i*fft_len][0] = val;
in[j+i*fft_len][1] = val ;
}
}
fftw_execute(plan);
for (int i =0; i < num_fft; i++){
for (int j = 0; j < fft_len; j++){
comp[j+i*fft_len] = std::complex<double>(out[j+i*fft_len][0]/fft_len, out[j+i*fft_len][1]/fft_len);
}
}
fftw_destroy_plan(plan);
"""

fftw_info = get_info( 'fftw3' )
num_fft = 100000
fft_len = 32
val = 0.
comp = np.array(np.zeros(num_fft*fft_len), dtype=complex)
inline( fftw_plan_many_dft,
['comp', 'num_fft', 'fft_len', 'val'],