My goal in learning CUDA originally was to write a GPU-accelerated $O(n^2)$ gravitational force calculation algorithm. At UCSD, Physics 141 students have access to a computer with eight NVidia Titan GPUs. That is an insane amount of computing power. $O(n^2)$ algorithms are very slow, but I was able to run a 400,000 particle gravitational attraction calculation with less than one second per timestep. That is absolutely insane, blistering fast performance. It’s about on-par with a single CPU core running tree code – that is to say, I would expect a single core tree code to process 200k to 400k particles per second.