This post did have a bunch of benchmarks and times that were incorrect and have been removed. I was not comparing every particle to every other particle in the gravity calculations. That was the cause of the impossibly fast speeds I had listed.

See here for the latest kernel code and explanations.

What is left of this post is still correct. The object counts are correct and there are millions of particles being shown. It is just the actual gravity calculations that are not complete.

Previous Results

Since I originally posted the following 3D gravity movies to YouTube…

…there were questions and some skepticism in the comments so hopefully this blog post helps clarify things. That skepticism about how I could possibly calculate these millions of objects at once was well founded.

Latest Results

Here is a movie with 3,000,000 particles.

Unfortunately YouTube’s compression ruined the movie quality a bit. Mostly due to the noisy/static like nature of the millions of particles. Not enough I frames and too many P frames. Here are a series of screenshots showing what the frames looked like before the compression.

To push it further, here is a rotating disk of 5,000,000 objects.

Again, here are some uncompressed screenshots.

Try It Yourself

If use Windows you can download Visions of Chaos and see the simulations run yourself. Here are a few quick steps to get you going;

1. Open Visions of Chaos
2. Select Mode->Gravity->3D Gravity to change into the 3D Gravity mode
3. The 3D Gravity Settings dialog will appear
4. Change the number of objects to 1,000,000
5. Change the Time step to 0.02
6. Change the Size of objects to 0.2
7. Check Create AVI frames if you want to create an AVI movie from the simulation
8. Click OK

Make sure you have the latest video card drivers so the OpenCL code runs as optimal as possible. For NVidia go here and for AMD go here.

Nice work! Compute shaders are terrific. I’ve done some work like this, and it looks like you’re not performing the gravitational interaction calculations between every possible pair, otherwise you’d have a hard time pushing this past 400,000 particles at 1fps. It seems like each particle is only being attracted to a small number of other particles. Also, instead of skipping close interactions, you can get rid of the singularity by changing the denominator in your force calculation from (distance^2) to (distance^2 + mingravdist^2). Then you don’t need the conditional. Finally, you can divide by mass and apply the velocity after all the accelerations have been added, not in the middle of the loop. Happy computing!

I have seen your very impressive fluids and gravity simulations before. Really nice work.

Removing the if only gets me a tiny speedup. Can you explain how to move the mass and velocity out of the loop?

Checking my code outside the OpenCL kernel does show a problem. You are correct and I am not comparing every particle to every other particle. Turns out I am only comparing every particle to approximately 100 others. That surprisingly still gives a reasonable and fast result. But wrong.

Thanks for your comment. I will have to revisit this ASAP and get a correct working simulation going.

Thanks! Removing the conditional won’t cause much of a speedup initially, because all the particles are initially reasonably well-separated and most thread blocks won’t trigger the conditional, but these systems tend toward agglomeration, so there will eventually be lots of very close particles. it may not be much, but GPUs do an extra flop with more ease than a frequently-predicted conditional.

As for the mass and velocity, try this algorithm (I’m skipping some obvious steps):

acc[index] = 0
for int (a=0;…
dx=pos[a]…
distance=sqrt(…
dx=dx/distance
force=mass[a]/(distance*distance)
acc[index].x += dx*force
end for
acc[index].x *= forcescalefactor/massamount
vel[index].x += dt * acc[index].x

Saves a lot of math. It should be about 20 flops per interaction. This is, of course, for an Euler method, and Verlet would be an outstanding choice of integrator here.

It was only a problem of expectations—the images themselves are beautiful, regardless. An astrophysicist might think that it’s cheating, but it’s perfectly valid to have one set of masses perform the full O(N^2) interaction calculation, while another set of particles is essentially massless and is only moved around by that smaller first set. I did this in my Everything is Made of Atoms piece. https://www.youtube.com/watch?v=1yK2PdrKkmI There are maybe 2000 “active” (vortex) particles (with each influencing each other), and 32000 passive particles (which are only moved around by the active particles). The GPU only needs to do 2000*2000 + 32000*2000 interactions per frame. The trick is to have enough “active” particles that the whole simulation has the desired character.

Jason,
I made one error in the pseudo-code above. There’s no need to divide by the target particle’s mass at all, because we didn’t include it in the “force=” calculation. We’re computing acceleration, so the mass of the target body falls out of the equations.
Also note that with OpenGL 4.3, you can use shared memory on the GPU. Load in a cluster or tile of source masses into shared memory and do all the computation on them, then load another tile. Using that system, I’ve been able to get about 50% of theoretical peak performance on recent NVIDIA GPUs using the nbody06.lua scene from this package: https://bitbucket.org/jimbo00000/opengl-with-luajit
-Mark

Nice work! Compute shaders are terrific. I’ve done some work like this, and it looks like you’re not performing the gravitational interaction calculations between every possible pair, otherwise you’d have a hard time pushing this past 400,000 particles at 1fps. It seems like each particle is only being attracted to a small number of other particles. Also, instead of skipping close interactions, you can get rid of the singularity by changing the denominator in your force calculation from (distance^2) to (distance^2 + mingravdist^2). Then you don’t need the conditional. Finally, you can divide by mass and apply the velocity after all the accelerations have been added, not in the middle of the loop. Happy computing!

Hi Mark,

I have seen your very impressive fluids and gravity simulations before. Really nice work.

Removing the if only gets me a tiny speedup. Can you explain how to move the mass and velocity out of the loop?

Checking my code outside the OpenCL kernel does show a problem. You are correct and I am not comparing every particle to every other particle. Turns out I am only comparing every particle to approximately 100 others. That surprisingly still gives a reasonable and fast result. But wrong.

Thanks for your comment. I will have to revisit this ASAP and get a correct working simulation going.

Thanks,

Jason.

Thanks! Removing the conditional won’t cause much of a speedup initially, because all the particles are initially reasonably well-separated and most thread blocks won’t trigger the conditional, but these systems tend toward agglomeration, so there will eventually be lots of very close particles. it may not be much, but GPUs do an extra flop with more ease than a frequently-predicted conditional.

As for the mass and velocity, try this algorithm (I’m skipping some obvious steps):

acc[index] = 0

for int (a=0;…

dx=pos[a]…

distance=sqrt(…

dx=dx/distance

force=mass[a]/(distance*distance)

acc[index].x += dx*force

end for

acc[index].x *= forcescalefactor/massamount

vel[index].x += dt * acc[index].x

Saves a lot of math. It should be about 20 flops per interaction. This is, of course, for an Euler method, and Verlet would be an outstanding choice of integrator here.

Thanks again Mark.

Now I understand why people were so skeptical about my movies. Once I got the code comparing all particles performance plummets.

Jason.

It was only a problem of expectations—the images themselves are beautiful, regardless. An astrophysicist might think that it’s cheating, but it’s perfectly valid to have one set of masses perform the full O(N^2) interaction calculation, while another set of particles is essentially massless and is only moved around by that smaller first set. I did this in my Everything is Made of Atoms piece. https://www.youtube.com/watch?v=1yK2PdrKkmI There are maybe 2000 “active” (vortex) particles (with each influencing each other), and 32000 passive particles (which are only moved around by the active particles). The GPU only needs to do 2000*2000 + 32000*2000 interactions per frame. The trick is to have enough “active” particles that the whole simulation has the desired character.

Jason,

I made one error in the pseudo-code above. There’s no need to divide by the target particle’s mass at all, because we didn’t include it in the “force=” calculation. We’re computing acceleration, so the mass of the target body falls out of the equations.

Also note that with OpenGL 4.3, you can use shared memory on the GPU. Load in a cluster or tile of source masses into shared memory and do all the computation on them, then load another tile. Using that system, I’ve been able to get about 50% of theoretical peak performance on recent NVIDIA GPUs using the nbody06.lua scene from this package: https://bitbucket.org/jimbo00000/opengl-with-luajit

-Mark