My goal in learning CUDA originally was to write a GPU-accelerated $O(n^2)$ gravitational force calculation algorithm. At UCSD, Physics 141 students have access to a computer with eight NVidia Titan GPUs. That is an insane amount of computing power. $O(n^2)$ algorithms are very slow, but I was able to run a 400,000 particle gravitational attraction calculation with less than one second per timestep. That is absolutely insane, blistering fast performance. It’s about on-par with a single CPU core running tree code – that is to say, I would expect a single core tree code to process 200k to 400k particles per second.

My point is that you can get pretty dang far with the brute force, “dumb” algorithm, once it’s running in a massively parallel environment. Don’t knock it ‘til you’ve tried it. Here’s how you can go from 0 CUDA knowledge to a working multi-GPU, massively parallel, brute force gravitational algorithm, in six simple steps! There is even experimental evidence that this method works. I did it myself :)

  1. Get your hands on a copy of CUDA By Example, by Sanders and Kandrot. Find it online, buy it on amazon, or if you’re a UCSD student, recall and check it out from the library. (“I don’t care how you get it, buy one, hunt one down with a knife, but you treat it with respect.”)
  2. Get your hands on the source code for the examples in CUDA By Example. It’s free for everyone, available on the NVidia website. Also, make note of where, exactly, your CUDA/Samples/5_Simulations/nbody/ folder is. If you have CUDA installed, it should be somewhere on your hard drive. You should, at some point, compile the example too.
  3. Go through chapters 1 through 4, editing, modifying, and compiling and running all of the code. I ran the examples on my Ubuntu home computer, and I recommend you run it on some linux computer (if you want to use the UCSD Titan GPU computer, you’ll have to be familiar with the command line, SSH, and linux compiling!)
  4. By now you should be able to write GPU code, and view and manage different computing devices. Onwards to gravity! Look up the free, online chapter of the book GPU Gems 3, Fast N-Body Simulation with CUDA. Yes, “fast” means $O(n^2)$. Make sure you can find and compile that 5_Simulations/nbody/ sample code I mentioned earlier.
  5. Now we’re in the woods a bit. At this step, I compared the GPU Gems code and the sample code (which isn’t very readable but can be useful), and deleting and adding swathes, until I got a single device, massively parallel GPU code that was all my own. That is, it was like the GPU Gems code, but it was a full/compilable demonstration, and it didn’t have any of the fluff that the CUDA library sample code had. This was the hardest step. Once you have a single GPU program working, the rest is easy.
  6. The last step is to make it a multi-GPU program. This is very easy. Chapter 11 of CUDA By Example is titled “CUDA C On Multiple GPUs”, but you DON’T need this chapter. You need no communication between GPUs. To run code on multiple devices at once, just switch devices, and start kernels on each, each with a different set of data. Each device gets a copy of the location of all $n$ particles, but only updates the forces felt by some range $[n_1,n_2)$ of particles. Once the calculation is done, combine the lists, ignoring the duplicate data. Don’t worry about the added workload of a few tens of megabytes (or more) of duplicate data in memory. We’re dealing with a big computational problem, and a few milliseconds is totally insignificant.

A note for students taking Physics 141 at UCSD

Keep in mind that along with the gravitational force calculation, you have to:

  1. Generate and load a galaxy.
  2. Render the particles.
  3. Do actual physics.
  4. Write a report.

Generating a galaxy is a pretty big undertaking. Plummer spheres won’t cut it any more! Rendering is a big undertaking: you can save particle positions in a text file, load it into python or mathematica, and render it there, but keep in mind that for 400k particles this will take up a lot of time and hard drive space and RAM. Most python/mathematica graphics commands weren’t meant for 400k particles. Doing actual physics (interpreting units correctly, getting a scientific result) is a sort of big undertaking, but sometimes you can do the physics after your simulation. The report isn’t too difficult, but it’s still something.

A note on “interpreting the units correctly”

I mean by this that if you run your simulation without giving any thought to the values of $dt$ in seconds, $x$ in meters, the mass of each particle in kilograms, and the gravitational constant $G$ in whatever its units are, you can use the exact same data and reinterpret the time/length scales. For example, a pair of orbiting points might be two dice a light-year apart which take $10^{24}$ years to orbit, or it might be a pair of black holes which orbit in a day. If you’re clever about it, the same “simulation data” which places the points at $\vec{x}=\pm (\cos(t),\sin(t))$ can tell you all the properties of both systems, just with different units for $x$ and $t$.

To work through the math of that, I found the following interpretation useful. Interpret the computer-handled values $x$, $t$, $m$, $G$ as unitless (they’re just scalars in memory). The actual, unitful values (denoted $\bar{x}$, $\bar{t}$, $\bar{m}$, $\bar{G}$) are the unitless values times some unitful scales: $\bar{x}=x \sigma$, $\bar{t}=t \tau$, $\bar{m}=\mu m$, $\bar{G}=g G$. The computerized values satisfy the following given equation:

The actual physical values satisfy the same equation with bars over everything. Replace the barred values with their unbarred ones times the units, and divide both sides of the equation by the unitless version to get:

This gives you a physical relation between your length, time, and mass scales, which you can use to reinterpret results as in the black holes/dice example above.

Source code

I will try to upload my source code to this project. If it’s been a week, you’re interested, and the source code is still not up, please email me.