We consider the problem of computing f_m(f_{m-1}(...f_1(x)...)) where each function can be broken up in pairs so that the computation at, e.g., indices k and l involve only the vales of the argument at positions k and l. That is, f_j(u)_k = f^+_j(u_k,u_l) and so on. This generalizes ``butterfly'' algorithms, such as Radix-2 for computing Fourier transforms. We demonstrate how to use a graphics Processing Unit (GPU), such as the Tesla C1060 with 240 cores, to perform a large number of executions of these algorithms efficiently in parallel. This has a general application among other things in the decoding of non-binary linear codes where the vectors are probability distributions over GF(256). Interestingly, in this case it appears bank-conflicts with shared memory cannot be completely avoided, yet may be reduced drastically by a non-trivial reordering of operations.