We consider the problem of computing
f_m(f_{m-1}(...f_1(x)...)) where each function can be broken up in pairs so that the computation at, e.g., 
indices k and l involve only the vales of the argument at positions k and l. That is, f_j(u)_k = f^+_j(u_k,u_l) and so on.
This generalizes ``butterfly'' algorithms, such as Radix-2 for computing Fourier transforms.

We demonstrate how to use a graphics Processing Unit (GPU), such as the Tesla C1060 with 240 cores, 
to perform a large number of executions of these algorithms efficiently in parallel. 
This has a general application among other things in the decoding of non-binary linear codes where 
the vectors are probability distributions over GF(256). Interestingly, in this case it appears bank-conflicts 
with shared memory cannot be completely avoided, yet may be reduced drastically by a non-trivial reordering of operations.