OPTIMIZING A PARTICLE SYSTEM FOR MULTI-CORE ARCHITECTURE

This article was originally published in the "Inner Product" column in Game Developer Magazine, April 2006

Download the sample application:
DOWNLOAD
229K

In my February column I discussed the various ways you might extend a game engine to take advantage of multi-core processing. On method mentioned was "Forking from a single thread. In this month's column I'm going to discuss how this is implemented in practice to speed up the processing of a particle system. I'll present the code needed to get multi-threaded processing working, and then benchmark various approaches and configurations.

PARTICLE SYSTEM

Particles systems are very common in games. They consist of a large number of small objects that each has a low memory and processing overhead, allowing you to get a very large number of objects on screen. Particle effects are used to simulate fire, smoke, explosions and various other physical effects.

In the example code (available online at gdmag.com) I implement a simple 2D particle system that nonetheless performs a significant amount of processing per particle, including a collision check to allow the particle to interact with the environment. The test case simulates 80,000 particles. particle_screen.jpg

My test machine has a 3.2 GHz dual core Pentium 4 Extreme Edition with hyper-threading enabled. This configuration gives us four hardware threads. In theory this should enable us to run code up to 2.6 times as fast as a regular single core, non-hyper-threaded CPU. The Figure of 2.6 is arrive at by a theoretical 100% speed increase from using two cores, followed by a 30% increase from hyper-threading. In real-world tests, I actually got a 3.0x speed increase, slightly better than expected.

THE CODE

I have attempted to make the code duplicate what you might find in a real game engine. There is a particle class, which has an update() function. There is an array of particles, and a count of particles. To update the position of the particles, the particle manager code simply iterates over them and calls their update() function (listing 1).

Listing 1:

for (int i=0;i<n;i++)
	{
		mp_particles[i].Update(update_time);
	}

The update code itself moves the particle, and checks for collision with the environment, and updates the color of the particle based on the speed. (Listing 2). This is a fairly significant amount of code to be executed per particle.

Listing 2:

void CParticle::Update(float time)
{
	m_old_pos = m_pos;
	Vector2 pos2 = m_pos + m_vel * time;
	Vector2 to = m_pos - Vector2(x,y);
	float force = 1000.0f/to.Length2();
	m_vel += to * force;
	m_vel -= m_vel * 0.03f;
	m_color = 0xff800000 + (int)m_vel.Length();
	if (whisker.CheckCollision(m_pos, pos2)) 	{
		m_pos = whisker.GetPoint() + 0.01f * whisker.GetNormal();
		m_vel = m_vel - 2.0f * DotProduct((m_vel),whisker.GetNormal()) * whisker.GetNormal();
	} else {
		m_pos = pos2;
	}
}

This type of processing is an ideal candidate for parallel processing. The order of processing the particles is unimportant, since they do not interact with each other. The particle processing does not use any common resources. The method of processing by stepping through an array can easily be broken into sub-sections of the array.

FORKING

So the first thing I tried was forking threads to process the particles in groups. To do this the particle manager's update loop changed from Listing 1, to Listing 3.

Listing 3:

for (int t=0;t<MAX_THREAD;t++) {
  ThreadIndex[t] = t;
  thThread[t] = CreateThread(NULL,0,thParticleUpdate,(LPVOID)&ThreadIndex[t],0,&threadId);
}
WaitForMultipleObjects(MAX_THREAD,&thThread[0],true, INFINITE);
for (int t=0;t<MAX_THREAD;t++) {
  CloseHandle(thThread[t]);
}

This creates MAX_THREADS threads, which immediately begin executing. The main thread then waits at the call to WaitForMultipleObjects until all the threads are finished.

The value in MAX_THREADS should match the number of hardware threads your system is capable of supporting. In our case the optimal value is four, but we perform tests with one, two, three and four threads.

The thread that is created calls the function thParticleUpdate, shown in Listing 4. This divides the particles up into chunks based on the thread number. With four threads, the first thread will process particles, 0-19999, the second 20000-39999, the third 40000-59999 and the fourth 60000-79999.

Listing 4:

DWORD WINAPI thParticleUpdate(LPVOID p)
{
	int thread = *(int*)p;
	float time = update_time;
	int n = g_ParticleManager.m_num_particles;
	int step = n / MAX_THREAD;
	int start = step*thread;
	int end = step*(thread+1);
	if (thread == MAX_THREAD-1) end = n;
	for (int i=start;i<end;i++)	{
		g_ParticleManager.mp_particles[i].Update(time);
	}
}

WORKER THREADS

Since creating a thread incurs some overhead, it is best not to create threads every time we need to fork. Instead we can get functionally the same results by creating the threads upon program initialization, and then using events to signal the thread to start and indicate when it has stopped. This approach is known as using "worker threads". The particle manager update function then just reduces to signaling the thread start events, and then waiting for the events to finish (Listing 5).

Listing 5

for (int t=0;t<MAX_THREAD;t++)
	SetEvent(thStartThreading[t]);
WaitForMultipleObjects(MAX_THREAD,&thStopThreading[0],true, INFINITE);

The threads themselves are initialized at startup, along with the arrays of events used to signal a thread to start, and for the threads to signal that they have completed (Listing 6)

Listing 6

for (int t=0;t<MAX_THREAD;t++){
    ThreadIndex[t] = t;
    thStartThreading[t] = CreateEvent(NULL, FALSE, FALSE, NULL);
    thStopThreading[t]  = CreateEvent(NULL, FALSE, FALSE, NULL);
    thThread[t] = CreateThread(NULL,0,thParticleUpdate,(LPVOID)&ThreadIndex[t],0,&threadId);
}

Then the function thParticleUpdate needs to be modified, so that instead of being a one-shot function, it is now a loop that continually executes the same piece of code whenever the thStartThreading flag is set. (Listing 7).

Listing 7:

DWORD WINAPI thParticleUpdate(LPVOID p)
{
	int thread = *(int*)p;
	while (1)
	{
		WaitForSingleObject(thStartThreading[thread],INFINITE);
		/* … BODY OF FUNCTION … */
		SetEvent(thStopThreading[thread]);
	}
}

MORE REALISM

In order for the tests to be more realistic, I also wanted to run them as if there was a physics engine taking up the majority of the processing time in a single thread. So I added another thread that just constantly multiplied two numbers together and stored the result (listing 7)

Listing 7:

float dummy_a = 1.1f;
DWORD WINAPI DummyThread(LPVOID p)
{
	while (dummy_a != 0)
		dummy_a *= 1.00001f;
	return 0;
}

This was then run in a separate thread. It's not an ideal test, as it just constantly does work, rather than a fixed amount of work. But it roughly simulates the case where you can keep a thread busy for an entire frame.

RESULTS

I tested three software configurations: Forked threads, Worker threads and worker threads plus physics. Each of these configurations was tested with 1,2,3 and four processors, and the results were converted into a percentage of the time taken to update the particles in the normal manner (listing 1). The code was timed both for how long it took to update the logic (columns L1, L2, etc), and how long it too to execute a whole frame. (F1, F2, etc). Smaller numbers indicate less time taken, and hence faster.

After running in three different software configurations, I did three different hardware configurations. The test machine allowed me to disable both dual core and hyperthreading via the bios, so I tested with all three possible combinations. The Results are shown in table 1.

ANALYSIS

As we expected, the fastest execution is with four worker threads. The particle system is updated in one third of the time taken by the unthreaded approach. The frame rate is almost doubled - although in this test, the particle system takes up 2/3 of all CPU time.

The difference between using forking threads and using worker threads is very striking, with the worker thread approach being about 12% faster than using forked threads. This is presumably due to the overhead in creating a new thread. Whenever a thread is created, it needs memory allocated, including a default 1MB for the stack. This, and other overhead, makes constantly creating and destroying threads a drain on resources.

With the dummy physics thread in the mix, we see some slowdown, but really not as much as you might think, considering the dummy thread should be hogging 100% of one of the cores, leaving just the 30% hyperthread capacity, and the other processor. So why do we only see a slight drop in performance?

The answer is the windows scheduler is attempting to spread the work evenly amongst the threads, which means that it will constantly preempt the dummy thread in order to let the worker threads (and the main thread) execute. That's why a simple loop is not a good test load. A more realistic dummy physics load would be to execute the loop a fixed number of times, and have the main thread stall until the dummy thread finishes each frame.

LESS CORES

Generally the fewer cores you have, the lower your performance. One initially surprising result was that by disable hyper-threading, the performance of the single threaded version actually improved slightly, with the framerate increasing 4%. This shows that if code is not specifically written with hyperthreading in mind, then enabling hyperthreading can actually hurt performance slightly. Some PC vendors allow you to choose your PC to be shipped with hyperthreading disabled.

This is unfortunate, as the potential is there for hyperthreading to greatly improve performance, and it would be a shame is people switched it off just because some games ran slowly with it enabled. Indeed, with a single core, and hyperthreading enabled, we get a very significant 24% improvement in framerate with two threads.

One interesting result is whenever there are two logical CPUs (either single core with hyperthreading, or dual core with no hyperthreading), then the performance drops drastically when moving from two threads to three threads. This is because we can now only execute two threads simultaneously. Since the execution time of a single worker thread is short, the task scheduler will not be able to share the work evenly, and so the first two threads might run to completion before the third thread even starts. It then runs with one CPU idle, giving a 25% decrease in performance. If the task were one that ran much longer, then the overhead of having three threads for two CPUs would be much less. Running four threads is only slightly slower than two threads as the work can be evenly divided.

CONCLUSIONS

Optimizing simple systems for multi-core and hyperthreaded machines is very straightforward, and yields very impressive results. Using worker threads instead of forked threads means that there is very little overhead in splitting up a task, and performance increases can be had even on systems with a relatively small number of objects.

The windows task scheduler usually does a good job of farming out the work to available processors. While you can force a particular thread onto a particular CPU using the SetThreadAffinityMask function, I did not see any performance gain from doing this. It might still be worth looking into though, depending on how your engine is structured.

Particle systems are just one of several potential things that could be optimized in this way. Anything that deals with a reasonable number of independent data objects is a candidate. Such things might include: skinning, debris physics, flocking or speech recognition. With appropriate synchronization methods, the multi-threaded approach can be extended to the updating of more complex interdependent objects, such as AI and game logic.

The best performance is gained by scaling your threads to match the available CPUs. On a fixed target like the Xbox 360, you can specifically target six hardware threads. On the PC, you can attempt to determine the number of CPUs available, but if in doubt, then four threads will generally work well. However you should consider that soon we will have quad-core hyperthreaded (such as Intel's "Tigerton"), machines, giving us eight logical CPUs. Ideally your engine should scale to the appropriate number of threads, and not introduce overhead on a single logical CPU machine.

RESOURCES

James Boer, Multithreaded Audio Programming techniques, Game Programming Gems 5, 2005, Charles River Media, Hingham, MA, p697

Intel Corporation's multi-core articles, http://intel.com/software/multicore

AMD Corporation's multi-core site: http://multicore.amd.com/en/