https://bitbucket.org/bradjcox/gpu-maven-plugin. Download and compile it from there with "mvn install"; I haven't published it to public repositories yet.
The GPU Maven Plugin compiles Java code with hand-selected Java kernels to CUDA that can run on any NVIDIA GPU of comppatibility level 2.0 or higher. It encapsulates the build process so that GPU code is as easy as compiling ordinary Java code with maven. The plugin relies on the NVidia CUDA software being installed which must be done separately.
The plugin source includes forks of Rootbeer and Soot with no modifications except essential bug repairs. Their author is attached to command-line tools and idiosyncratic build conventions and I couldn't wait for him any longer.
How it works
You write ordinary Java that designates code to run on the GPU by enclosing it a class that implements the Rootbeer "Kernel" interface. You use ordinary Java compilers to compile this into a jar of byte codes that the Java virtual machine can run. This plugin steps in at that point to turn the original jar into a new jar that contains CUDA kernels that will run on the GPU when requested to do so by the non-kernel parts of the program. Only the kernels are converted to CUDA; the rest of the programs remains as Java byte codes.
Byte code is a stack-based format that is good for execution but not for the code analysis and translation steps to follow. Rootbeer uses Soot to find Kernel classes in the jar, to locate their dependencies and to translate them to Jimple, a 3-address format that Rootbeer translates into CUDA-compatible C++ source code. Finally, the NVidia tool chain compiles the generated source code to CUDA binaries and links them into a binary kernel that the original Java can run on the GPU.
The plugin handles these steps automatically so the build process looks like an ordinary Java compile to its users.
How to use it
See the gpu-timings folder for example applications with poms that show how to to compile them. See gpu-rootbeer/docs for details.
- gpu-mandelbrot: A Java mandelbrot generator based on many CPU threads.
- gpu-mandelbrot-gpu: gpu-mandelbrot modified to run each thread as GPU threads. The goal was to compare performance but this step has not been completed.
- gpu-timings: Several common algorithms instrumented to compare CPU-only versus GPU performance. Average computes the average of arrays of varying sizes. SumSq computes the sum of the squares. IntMatrixMultiply and DoubleMatrixMultiply multiplies two matrices of varying sizes.
Is it worth it?
It depends on your application, and in particular on the number of GPU tasks and the amount of work they do in parallel, with significant but so far unmeasured costs for transferring data to and from the GPU.
For example, the gpu-timings/Average application computes the average of large arrays by subdividing the array into chunks, assigning a GPU task to summing each chunk, and computing the average when the tasks are done. Tentative conclusions are that conversion of hand-designated Java kernels to GPU/CUDA becomes beneficial for about a thousand threads (10^3) each processing ten thousand values (10^4) in parallel. The improvement is 2x-4x at those levels and grows to 37.2x for 10^5 tasks and 10^6 values.