INSTALLING

You need to do a

*sudo apt-get install python2.x-dev*to get inline to work (I did 2.7). When*scipy.weave.inline()*runs, there is a call to the C header 'Python.h' which is contained in the dev package.
If there are issues with building the C code, i.e. cryptic errors like

*"Error Code -1 returned"*, adding*verbose=2*to the arguments of the inline() call will show every line output as it's being built - this is very useful for figuring out why the inline build is breaking!
BACKGROUND

This code pretty much builds on my previous blog post about FFTW. I am a big fan of the speed and simplicity of Python, especially for interacting with the web. However, after my data is fetched from the web, I also want to perform some pretty intense analytics, and even NumPy left something to be desired. After reading this page on heavy computation in Python (http://www.scipy.org/PerformancePython), I found out that inline C can get pretty close to pure C functions, and by interfacing the PyObjects directly, disk I/O can be avoided, which may also save some compute time. And it is also much, much simpler (and faster to code) in Python instead of pure C.

CODES OF NOTE

There are a couple of nice things about inlining these functions. Notice the

*['comp', 'num_fft', 'fft_len', 'val']*inside the*scipy.weave.inline()*call. This exposes those Python objects to the C code, allowing them to be used. I use*num_fft*,*len_fft*, and*val*to set loop sizes and initial values, then after performing the FFTs, the values in*out*are written to*comp*using C array indexing. Because the numpy array is declared as complex, the real and imaginary components are combined and written using*(re,imag)*form.
FINAL THOUGHTS

Here are the statistics I get running two chunks of code against each other, each using dft_plan_many to do 100,000 length 32 FFTs, with FFTW_PLAN set to FFTW_ESTIMATE.

*Python*

*$ time ./fftw_plan_many.py*

real 0m0.738s

user 0m0.416s

sys 0m0.192s

*C*

*$ time ./many_dft.o*

real 0m0.188s

user 0m0.064s

sys 0m0.088s

Not bad to do 100,000 FFTs, albeit small ones (32 bin) in < 1 sec. However C (like Sebulba) always wins in numerics.

I am planning to continue this investigation thread at a later date. Once my nVidia development computer gets up and running, I will likely attempt to do CUDAFFT with this same interface, and compare that to the current findings.

SIMPLE CODE

#!/usr/bin/python

from scipy.weave import inline

inline( 'printf("Hello World");',

headers=['<stdio.h>'],

compiler='gcc')

print ""

PYTHON FFTW PLAN MANY CODE

#!/usr/bin/python

import numpy as np

from scipy.weave import inline

from numpy.distutils.system_info import get_info

fftw_plan_many_dft = \

"""

fftw_complex* in;

fftw_complex* out;

fftw_plan plan;

in = (fftw_complex *) fftw_malloc(sizeof(fftw_complex)*fft_len*num_fft);

out = (fftw_complex *) fftw_malloc(sizeof(fftw_complex)*fft_len*num_fft);

plan = fftw_plan_many_dft( 1,

&fft_len,

num_fft,

in,

NULL,

num_fft,

1,

out,

NULL,

num_fft,

1,

FFTW_FORWARD,

FFTW_ESTIMATE );

for (int i = 0; i < num_fft; i++){

val+=1;

for (int j = 0; j < fft_len; j++) {

in[j+i*fft_len][0] = val;

in[j+i*fft_len][1] = val ;

}

}

fftw_execute(plan);

for (int i =0; i < num_fft; i++){

for (int j = 0; j < fft_len; j++){

comp[j+i*fft_len] = std::complex<double>(out[j+i*fft_len][0]/fft_len, out[j+i*fft_len][1]/fft_len);

}

}

fftw_destroy_plan(plan);

"""

fftw_info = get_info( 'fftw3' )

num_fft = 100000

fft_len = 32

val = 0.

comp = np.array(np.zeros(num_fft*fft_len), dtype=complex)

inline( fftw_plan_many_dft,

['comp', 'num_fft', 'fft_len', 'val'],

headers=['<fftw3.h>'],

libraries=['fftw3'],

include_dirs=fftw_info['include_dirs'],

library_dirs=fftw_info['library_dirs'],

compiler='gcc' )

print comp