I would like to port Stephen Milborrow's implementation of Jerome Friedman's MARS (Multivariate Adaptive Regression Splines) algorithm.
It's written in C (there are some optional R addons which I don't need, and some BLAS addons which would be nice to keep), and I would like to port it to CUDA v5.0 C. I would like to have, as the end result, a kernel which runs on one gpu thread block. After our contract concludes, I would take and incorporate the kernel into a program that calls several kernels to process different datasets concurrently. It seems pretty straightforward, but I don't understand the CUDA framework well enough to do this quickly and efficiently. A successful contract requires CUDA and C. I think it can be ported without knowing machine-learning, but it can't hurt. Please don't hesitate to write any questions. Thank you.