NOTE: This post is currently WRONG. I am not correctly comparing every particle to every other particle in the gravity calculations. So, consider the listed results and speeds as incorrect. I will update this post once I get a correct working model going.

OpenCL allows you to write programs that run on your video card processor chip (GPU). GPUs can have many hundreds of cores so can process many more things simultaneously than a CPU can. A good candidate for parallel processing is a gravity simulation.

__Previous Results__

Since I originally posted the following 3D gravity movies to YouTube…

…there were questions and some skepticism in the comments so hopefully this blog post helps clarify things.

__Coding Information__

It has been a while since I was working with the gravity code, but this should be enough info for those interested.

This is the OpenCL kernel code that does the calculation for every object each frame. I had to screen capture it as WordPress kept deleting lines no matter what tags I used.

The kernel uses the standard Newtonian gravitation formulas. No other fancy calculations are included. So no energy conservation, no collision detection, no objects clustering into planets, no dark matter, no dissipation, no energy conservation, no overall angular momentum constraint, no speed-of-light limit on propagation of gravity taken into account and no inertia of mass. Basically it is the simplest possible formulas that give gravity like results.

The pos, vel, acc and mass are float arrays that hold the position, velocity, acceleration and mass for all the objects. Every frame those arrays are filled and passed to OpenCL using clCreateBuffer and clSetKernelArg. Once the above kernel code processes the objects the values are read back from the GPU using clEnqueueReadBuffer and then OpenGL (GL not CL) is used to display the objects to the screen using transparent billboard quads. Display is currently done in software OpenGL which is why the OpenGL display time is slower than the OpenCL calculations time.

The mingravdist check makes the calculations ignore objects that are too close together. If you allow objects very close together to interact they end up flinging objects out of the simulation space at a very high speed. For a world size of 300 units I use a mingravdist of 1.

__Latest Results__

Here is a new movie I just rendered with 3,000,000 particles. The frames took 4,700 ms to render on a GeForce GTX 570. Out of that time 844 ms was the OpenCL calculations. The rest of the time was rendering and passing the data back and forth to and from the GPU.

Unfortunately YouTube’s compression ruined the movie quality a bit. Mostly due to the noisy/static like nature of the millions of particles. Not enough I frames and too many P frames. Here are a series of screenshots showing what the frames looked like before the compression.

To push it further, here is a rotating disk of 5,000,000 objects. This one took 6,950 ms per frame with 1,420 ms for OpenCL calculations.

Again, here are some uncompressed screenshots.

The GTX 570 is on the lower end of the GeForce cards these days so the simulation speeds should be faster on the newer cards.

10,000,000 objects was also possible, but until I work out better ways of display it is not worth posting as most of the detail gets lost in the cloud of billboard particles. 10,000,000 was taking 8,780 ms per frame with 2,840 ms OpenCL time.

__Try It Yourself__

If use Windows you can download Visions of Chaos and see the simulations run yourself. Here are a few quick steps to get you going;

1. Open Visions of Chaos

2. Select Mode->Gravity->3D Gravity to change into the 3D Gravity mode

3. The 3D Gravity Settings dialog will appear

4. Change the number of objects to 1,000,000

5. Change the Time step to 0.02

6. Change the Size of objects to 0.2

7. Check Create AVI frames if you want to create an AVI movie from the simulation

8. Click OK

Make sure you have the latest video card drivers so the OpenCL code runs as optimal as possible. For NVidia go here and for AMD go here.

Jason.

Nice work! Compute shaders are terrific. I’ve done some work like this, and it looks like you’re not performing the gravitational interaction calculations between every possible pair, otherwise you’d have a hard time pushing this past 400,000 particles at 1fps. It seems like each particle is only being attracted to a small number of other particles. Also, instead of skipping close interactions, you can get rid of the singularity by changing the denominator in your force calculation from (distance^2) to (distance^2 + mingravdist^2). Then you don’t need the conditional. Finally, you can divide by mass and apply the velocity after all the accelerations have been added, not in the middle of the loop. Happy computing!

Hi Mark,

I have seen your very impressive fluids and gravity simulations before. Really nice work.

Removing the if only gets me a tiny speedup. Can you explain how to move the mass and velocity out of the loop?

Checking my code outside the OpenCL kernel does show a problem. You are correct and I am not comparing every particle to every other particle. Turns out I am only comparing every particle to approximately 100 others. That surprisingly still gives a reasonable and fast result. But wrong.

Thanks for your comment. I will have to revisit this ASAP and get a correct working simulation going.

Thanks,

Jason.

Thanks! Removing the conditional won’t cause much of a speedup initially, because all the particles are initially reasonably well-separated and most thread blocks won’t trigger the conditional, but these systems tend toward agglomeration, so there will eventually be lots of very close particles. it may not be much, but GPUs do an extra flop with more ease than a frequently-predicted conditional.

As for the mass and velocity, try this algorithm (I’m skipping some obvious steps):

acc[index] = 0

for int (a=0;…

dx=pos[a]…

distance=sqrt(…

dx=dx/distance

force=mass[a]/(distance*distance)

acc[index].x += dx*force

end for

acc[index].x *= forcescalefactor/massamount

vel[index].x += dt * acc[index].x

Saves a lot of math. It should be about 20 flops per interaction. This is, of course, for an Euler method, and Verlet would be an outstanding choice of integrator here.

Thanks again Mark.

Now I understand why people were so skeptical about my movies. Once I got the code comparing all particles performance plummets.

Jason.

It was only a problem of expectations—the images themselves are beautiful, regardless. An astrophysicist might think that it’s cheating, but it’s perfectly valid to have one set of masses perform the full O(N^2) interaction calculation, while another set of particles is essentially massless and is only moved around by that smaller first set. I did this in my Everything is Made of Atoms piece. https://www.youtube.com/watch?v=1yK2PdrKkmI There are maybe 2000 “active” (vortex) particles (with each influencing each other), and 32000 passive particles (which are only moved around by the active particles). The GPU only needs to do 2000*2000 + 32000*2000 interactions per frame. The trick is to have enough “active” particles that the whole simulation has the desired character.

Jason,

I made one error in the pseudo-code above. There’s no need to divide by the target particle’s mass at all, because we didn’t include it in the “force=” calculation. We’re computing acceleration, so the mass of the target body falls out of the equations.

Also note that with OpenGL 4.3, you can use shared memory on the GPU. Load in a cluster or tile of source masses into shared memory and do all the computation on them, then load another tile. Using that system, I’ve been able to get about 50% of theoretical peak performance on recent NVIDIA GPUs using the nbody06.lua scene from this package: https://bitbucket.org/jimbo00000/opengl-with-luajit

-Mark