Inner Product « Cowboy Programming

January 5, 2007

Multithreading Particle Sytems

Filed under: Game Development,Inner Product — Mick West @ 12:16 pm

OPTIMIZING A PARTICLE SYSTEM FOR MULTI-CORE ARCHITECTURE

This article was originally published in the â€œInner Productâ€ column in Game Developer Magazine, April 2006

Download the sample application:
DOWNLOAD
229K

In my February column I discussed the various ways you might extend a game engine to take advantage of multi-core processing. On method mentioned was â€œForking from a single thread. In this monthâ€™s column Iâ€™m going to discuss how this is implemented in practice to speed up the processing of a particle system. Iâ€™ll present the code needed to get multi-threaded processing working, and then benchmark various approaches and configurations.

PARTICLE SYSTEM

Particles systems are very common in games. They consist of a large number of small objects that each has a low memory and processing overhead, allowing you to get a very large number of objects on screen. Particle effects are used to simulate fire, smoke, explosions and various other physical effects.

In the example code (available online at gdmag.com) I implement a simple 2D particle system that nonetheless performs a significant amount of processing per particle, including a collision check to allow the particle to interact with the environment. The test case simulates 80,000 particles.

My test machine has a 3.2 GHz dual core Pentium 4 Extreme Edition with hyper-threading enabled. This configuration gives us four hardware threads. In theory this should enable us to run code up to 2.6 times as fast as a regular single core, non-hyper-threaded CPU. The Figure of 2.6 is arrive at by a theoretical 100% speed increase from using two cores, followed by a 30% increase from hyper-threading. In real-world tests, I actually got a 3.0x speed increase, slightly better than expected.

THE CODE

I have attempted to make the code duplicate what you might find in a real game engine. There is a particle class, which has an update() function. There is an array of particles, and a count of particles. To update the position of the particles, the particle manager code simply iterates over them and calls their update() function (listing 1).

Listing 1:

for (int i=0;i<n;i++)
	{
		mp_particles[i].Update(update_time);
	}

The update code itself moves the particle, and checks for collision with the environment, and updates the color of the particle based on the speed. (Listing 2). This is a fairly significant amount of code to be executed per particle.

Listing 2:

void CParticle::Update(float time)
{
	m_old_pos = m_pos;
	Vector2 pos2 = m_pos + m_vel * time;
	Vector2 to = m_pos - Vector2(x,y);
	float force = 1000.0f/to.Length2();
	m_vel += to * force;
	m_vel -= m_vel * 0.03f;
	m_color = 0xff800000 + (int)m_vel.Length();
	if (whisker.CheckCollision(m_pos, pos2)) 	{
		m_pos = whisker.GetPoint() + 0.01f * whisker.GetNormal();
		m_vel = m_vel - 2.0f * DotProduct((m_vel),whisker.GetNormal()) * whisker.GetNormal();
	} else {
		m_pos = pos2;
	}
}

This type of processing is an ideal candidate for parallel processing. The order of processing the particles is unimportant, since they do not interact with each other. The particle processing does not use any common resources. The method of processing by stepping through an array can easily be broken into sub-sections of the array.

FORKING

So the first thing I tried was forking threads to process the particles in groups. To do this the particle managerâ€™s update loop changed from Listing 1, to Listing 3.

Listing 3:

for (int t=0;t<MAX_THREAD;t++) {
  ThreadIndex[t] = t;
  thThread[t] = CreateThread(NULL,0,thParticleUpdate,(LPVOID)&ThreadIndex[t],0,&threadId);
}
WaitForMultipleObjects(MAX_THREAD,&thThread[0],true, INFINITE);
for (int t=0;t<MAX_THREAD;t++) {
  CloseHandle(thThread[t]);
}

This creates MAX_THREADS threads, which immediately begin executing. The main thread then waits at the call to WaitForMultipleObjects until all the threads are finished.

The value in MAX_THREADS should match the number of hardware threads your system is capable of supporting. In our case the optimal value is four, but we perform tests with one, two, three and four threads.

The thread that is created calls the function thParticleUpdate, shown in Listing 4. This divides the particles up into chunks based on the thread number. With four threads, the first thread will process particles, 0-19999, the second 20000-39999, the third 40000-59999 and the fourth 60000-79999.

Listing 4:

DWORD WINAPI thParticleUpdate(LPVOID p)
{
	int thread = *(int*)p;
	float time = update_time;
	int n = g_ParticleManager.m_num_particles;
	int step = n / MAX_THREAD;
	int start = step*thread;
	int end = step*(thread+1);
	if (thread == MAX_THREAD-1) end = n;
	for (int i=start;i<end;i++)	{
		g_ParticleManager.mp_particles[i].Update(time);
	}
}

WORKER THREADS

Since creating a thread incurs some overhead, it is best not to create threads every time we need to fork. Instead we can get functionally the same results by creating the threads upon program initialization, and then using events to signal the thread to start and indicate when it has stopped. This approach is known as using â€œworker threadsâ€. The particle manager update function then just reduces to signaling the thread start events, and then waiting for the events to finish (Listing 5).

Listing 5

for (int t=0;t<MAX_THREAD;t++)
	SetEvent(thStartThreading[t]);
WaitForMultipleObjects(MAX_THREAD,&thStopThreading[0],true, INFINITE);

The threads themselves are initialized at startup, along with the arrays of events used to signal a thread to start, and for the threads to signal that they have completed (Listing 6)

Listing 6

for (int t=0;t<MAX_THREAD;t++){
    ThreadIndex[t] = t;
    thStartThreading[t] = CreateEvent(NULL, FALSE, FALSE, NULL);
    thStopThreading[t]  = CreateEvent(NULL, FALSE, FALSE, NULL);
    thThread[t] = CreateThread(NULL,0,thParticleUpdate,(LPVOID)&ThreadIndex[t],0,&threadId);
}

Then the function thParticleUpdate needs to be modified, so that instead of being a one-shot function, it is now a loop that continually executes the same piece of code whenever the thStartThreading flag is set. (Listing 7).

Listing 7:

DWORD WINAPI thParticleUpdate(LPVOID p)
{
	int thread = *(int*)p;
	while (1)
	{
		WaitForSingleObject(thStartThreading[thread],INFINITE);
		/* â€¦ BODY OF FUNCTION â€¦ */
		SetEvent(thStopThreading[thread]);
	}
}

MORE REALISM

In order for the tests to be more realistic, I also wanted to run them as if there was a physics engine taking up the majority of the processing time in a single thread. So I added another thread that just constantly multiplied two numbers together and stored the result (listing 7)

Listing 7:

float dummy_a = 1.1f;
DWORD WINAPI DummyThread(LPVOID p)
{
	while (dummy_a != 0)
		dummy_a *= 1.00001f;
	return 0;
}

This was then run in a separate thread. Itâ€™s not an ideal test, as it just constantly does work, rather than a fixed amount of work. But it roughly simulates the case where you can keep a thread busy for an entire frame.

RESULTS

I tested three software configurations: Forked threads, Worker threads and worker threads plus physics. Each of these configurations was tested with 1,2,3 and four processors, and the results were converted into a percentage of the time taken to update the particles in the normal manner (listing 1). The code was timed both for how long it took to update the logic (columns L1, L2, etc), and how long it too to execute a whole frame. (F1, F2, etc). Smaller numbers indicate less time taken, and hence faster.

After running in three different software configurations, I did three different hardware configurations. The test machine allowed me to disable both dual core and hyperthreading via the bios, so I tested with all three possible combinations. The Results are shown in table 1.

ANALYSIS

As we expected, the fastest execution is with four worker threads. The particle system is updated in one third of the time taken by the unthreaded approach. The frame rate is almost doubled â€“ although in this test, the particle system takes up 2/3 of all CPU time.

The difference between using forking threads and using worker threads is very striking, with the worker thread approach being about 12% faster than using forked threads. This is presumably due to the overhead in creating a new thread. Whenever a thread is created, it needs memory allocated, including a default 1MB for the stack. This, and other overhead, makes constantly creating and destroying threads a drain on resources.

With the dummy physics thread in the mix, we see some slowdown, but really not as much as you might think, considering the dummy thread should be hogging 100% of one of the cores, leaving just the 30% hyperthread capacity, and the other processor. So why do we only see a slight drop in performance?

The answer is the windows scheduler is attempting to spread the work evenly amongst the threads, which means that it will constantly preempt the dummy thread in order to let the worker threads (and the main thread) execute. Thatâ€™s why a simple loop is not a good test load. A more realistic dummy physics load would be to execute the loop a fixed number of times, and have the main thread stall until the dummy thread finishes each frame.

LESS CORES

Generally the fewer cores you have, the lower your performance. One initially surprising result was that by disable hyper-threading, the performance of the single threaded version actually improved slightly, with the framerate increasing 4%. This shows that if code is not specifically written with hyperthreading in mind, then enabling hyperthreading can actually hurt performance slightly. Some PC vendors allow you to choose your PC to be shipped with hyperthreading disabled.

This is unfortunate, as the potential is there for hyperthreading to greatly improve performance, and it would be a shame is people switched it off just because some games ran slowly with it enabled. Indeed, with a single core, and hyperthreading enabled, we get a very significant 24% improvement in framerate with two threads.

One interesting result is whenever there are two logical CPUs (either single core with hyperthreading, or dual core with no hyperthreading), then the performance drops drastically when moving from two threads to three threads. This is because we can now only execute two threads simultaneously. Since the execution time of a single worker thread is short, the task scheduler will not be able to share the work evenly, and so the first two threads might run to completion before the third thread even starts. It then runs with one CPU idle, giving a 25% decrease in performance. If the task were one that ran much longer, then the overhead of having three threads for two CPUs would be much less. Running four threads is only slightly slower than two threads as the work can be evenly divided.

CONCLUSIONS

Optimizing simple systems for multi-core and hyperthreaded machines is very straightforward, and yields very impressive results. Using worker threads instead of forked threads means that there is very little overhead in splitting up a task, and performance increases can be had even on systems with a relatively small number of objects.

The windows task scheduler usually does a good job of farming out the work to available processors. While you can force a particular thread onto a particular CPU using the SetThreadAffinityMask function, I did not see any performance gain from doing this. It might still be worth looking into though, depending on how your engine is structured.

Particle systems are just one of several potential things that could be optimized in this way. Anything that deals with a reasonable number of independent data objects is a candidate. Such things might include: skinning, debris physics, flocking or speech recognition. With appropriate synchronization methods, the multi-threaded approach can be extended to the updating of more complex interdependent objects, such as AI and game logic.

The best performance is gained by scaling your threads to match the available CPUs. On a fixed target like the Xbox 360, you can specifically target six hardware threads. On the PC, you can attempt to determine the number of CPUs available, but if in doubt, then four threads will generally work well. However you should consider that soon we will have quad-core hyperthreaded (such as Intelâ€™s â€œTigertonâ€), machines, giving us eight logical CPUs. Ideally your engine should scale to the appropriate number of threads, and not introduce overhead on a single logical CPU machine.

RESOURCES

James Boer, Multithreaded Audio Programming techniques, Game Programming Gems 5, 2005, Charles River Media, Hingham, MA, p697

Intel Corporationâ€™s multi-core articles, http://intel.com/software/multicore

AMD Corporationâ€™s multi-core site: http://multicore.amd.com/en/

Comments (2)

Evolve Your Hierarchy

Filed under: Game Development,Inner Product — Mick West @ 11:35 am

Refactoring Game Entities with Components

Up until fairly recent years, game programmers have consistently used a deep class hierarchy to represent game entities. The tide is beginning to shift from this use of deep hierarchies to a variety of methods that compose a game entity object as an aggregation of components. This article explains what this means, and explores some of the benefits and practical considerations of such an approach. I will describe my personal experience in implementing this system on a large code base, including how to sell the idea to other programmers and management.

GAME ENTITIES

Different games have different requirements as to what is needed in a game entity, but in most games the concept of a game entity is quite similar. A game entity is some object that exists in the game world, usually the object is visible to the player, and usually it can move around.

Some example entities:

Missile
Car
Tank
Grenade
Gun
Hero
Pedestrian
Alien
Jetpack
Med-kit
Rock

Entities can usually do various things. Here are some of the things you might want the entities to do

Run a script
Move
React as a rigid body
Emit Particles
Play located audio
Be packed up by the player
Be worn by the player
Explode
React to magnets
Be targeted by the player
Follow a path
Animate

TRADITIONAL DEEP HIERARCHIES

The traditional way of representing a set of game entities like this is to perform an object-oriented decomposition of the set of entities we want to represent. This usually starts out with good intentions, but is frequently modified as the game development progresses â€“ particularly if a game engine is re-used for a different game. We usually end up with something like figure 1, but with a far greater number of nodes in the class hierarchy.

As development progresses, we usually need to add various points of functionality to the entities. The objects must either encapsulate the functionality themselves, or be derived from an object that includes that functionality. Often, the functionality is added to the class hierarchy at some level near the root, such as the CEntity class. This has the benefit of the functionality being available to all derived classes, but has the downside of the associated overhead also being carried by those classes.

Even fairly simple objects such as rocks or grenades can end up with a large amount of additional functionality (and associated member variables, and possibly unnecessary execution of member functions). Often, the traditional game object hierarchy ends up creating the type of object known as â€œthe blobâ€. The blob is a classic â€œanti-patternâ€ which manifests as a huge single class (or a specific branch of a class hierarchy) with a large amount of complex interwoven functionality.

While the blob anti-pattern often shows up near the root of the object hierarchy, it will also show up in leaf nodes. The most likely candidate for this is the class representing the player character. Since the game is usually programmed around a single character, then the object representing that character often has a very large amount of functionality. Frequently this is implemented as a large number of member functions in a class such as CPlayer.

The result of implementing functionality near the root of the hierarchy is an overburdening of the leaf objects with unneeded functionality. However, the opposite method of implementing the functionality in the leaf nodes can also have unfortunate consequence. Functionality now becomes compartmentalized, so that only the objects specifically programmed for that particular functionality can use it. Programmers often duplicate code to mirror functionality already implemented in a different object. Eventually messy re-factoring is required by re-structuring the class hierarchy to move and combine functionality.

Take for example the functionality of having an object react under physics as a rigid body. Not every object needs to be able to do this. As you can see in figure 1, we just have the CRock and the CGrenade classes derived from CRigid. What happens when we want to apply this functionality to the vehicles? You have to move the CRigid class further up the hierarchy, making it more and more like the root-heavy blob pattern we saw before, with all the functionality bunched in a narrow chain of classes from which most other entity classes are derived.

AN AGGREGATION OF COMPONENTS

The component approach, which is gaining more acceptance in current game development, is one of separating the functionality into individual components that are mostly independent of one another. The traditional object hierarchy is dispensed with, and an object is now created as an aggregation (a collection) of independent components.

Each object now only has the functionality that it needs. Any distinct new functionality is implemented by adding a component.

A system of forming an object from aggregating components can be implemented in one of three ways, which may be viewed as separate stages in moving from a blob object hierarchy to a composite object.

OBJECT AS ORGANIZED BLOB

A common way of re-factoring a blob object is to break out the functionality of that object into sub-objects, which are then referenced by the first object. Eventually the parent blob object can mostly be replaced by a series of pointers to other objects, and the blob objectâ€™s member functions become interface functions for the functions of those sub-objects.

This may actually be a reasonable solution if the amount of functionality in your game objects is reasonably small, or if time is limited. You can implement arbitrary object aggregation simply by allowing some of the sub-objects to be absent (by having a NULL pointer to them). Assuming there are not too many sub-objects, then this still allows you the advantage of having lightweight pseudo-composite objects without having to implement a framework for managing the components of that object.

The downside is that this is still essentially a blob. All the functionality is still encapsulated within one large object. It is unlikely you will fully factor the blob into purely sub-objects, so you will still be left with some significant overhead, which will weight down your lightweight objects. You still have the overhead of constantly checking all the NULL pointers to see if they need updating.

OBJECT AS COMPONENT CONTAINER

The next stage is to factor out each of the components (the â€œsub-objectsâ€ in the previous example) into objects that share a common base class, so we can store a list of components inside of an object.

This is an intermediate solution, as we still have the root â€œobjectâ€ that represents the game entity. However, it may be a reasonable solution, or indeed the only practical solution, if a large part of the code base requires this notion of a game object as a concrete object.

Your game object then becomes an interface object that acts as a bridge between the legacy code in your game, and the new system of composite objects. As time permits, you will eventually remove the notion of game entity as being a monolithic object, and instead address the object more and more directly via its components. Eventually you may be able to transition to a pure aggregation.

OBJECT AS A PURE AGGREGATION

In this final arrangement, an object is simply the sum of its parts. Figure 2 shows a scheme where each game entity is comprised of a collection of components. There is no â€œgame entity objectâ€ as such. Each column in the diagram represents a list of identical components, each row can be though of as representing an objects. The components themselves can be treated as being independent of the objects they make up.

PRACTICAL EXPERIENCE

I first implemented a system of object composition from components when working at Neversoft, on the Tony Hawk series of games. Our game object system had developed over the course of three successive games until we had a game object hierarchy that resembled the blob anti-pattern I described earlier. It suffered from all the same problems: the objects tended to be heavyweight. Objects had unnecessary data and functionality. Sometimes the unnecessary functionality slowed down the game. Functionality was sometimes duplicated in different branches of the tree.

I had heard about this new-fangled â€œcomponent based objectsâ€ system on the sweng-gamedev mailing list, and decided it sounded like a very good idea. I set to re-organizing the code-base and two years later, it was done.

Why so long? Well, firstly we were churning out Tony Hawk games at the rate of one per year, so there was little time between games to devote to re-factoring. Secondly, I miscalculated the scale of the problem. A three-year old code-base contains a lot of code. A lot of that code became somewhat inflexible over the years. Since the code relied on the game objects being game objects, and very particular game objects at that, it proved to be a lot of work to make everything work as components.

EXPECT RESISTANCE

The first problem I encountered was in trying to explain the system to other programmers. If you are not particularly familiar with the idea of object composition and aggregation, then it can strike you as pointless, needlessly complex, and unnecessary extra work. Programmers who have worked with the traditional system of object hierarchies for many years become very used to working that way. They even become very good at working that way, and manage to work around the problems as they arise.

Selling the idea to management is also a difficult. You need to be able to explain in plain words exactly how this is going to help get the game done faster. Something along the lines of:

â€œWhenever we add new stuff to the game now, it takes a long time to do, and there are lots of bugs. If we do this new component object thing, it will let us add new stuff a lot quicker, and have fewer bugs.â€

My approach was to introduce it in a stealth manner. I first discussed the idea with a couple of programmers individually, and eventually convinced them it was a good idea. I then implemented the basic framework for generic components, and implemented one small aspect of game object functionality as a component.

I then presented this to the rest of the programmers. There was some confusion and resistance, but since it was implemented and working there was not much argument.

SLOW PROGRESS

Once the framework was established, the conversion from static hierarchy to object composition happened slowly. It is thankless work, since you spend hours and days re-factoring code into something that seems functionally no different to the code it replaces. In addition, we were doing this while still implementing new features for the next iteration of the game.

At an early point, we hit the problem of re-factoring our largest class, the skater class. Since it contained a vast amount of functionality, it was almost impossible to re-factor a piece at a time. In addition, it could not really be re-factored until the other object systems in the game conformed to the component way of doing things. These in turn could not be cleanly refactored as components unless the skater was also a component.

The solution here was to create a â€œblob component.â€ This was a single huge component, which encapsulated much of the functionality of the skater class. A few other blob components were required in other places, and we eventually shoehorned the entire object system into a collection of components. Once this was in place, the blob components could gradually be refactored into more atomic components.

RESULTS

The first results of this re-factoring were barely tangible. But over time the code became cleaner and easier to maintain as functionality was encapsulated in discreet components. Programmers began to create new types of object in less time simply by combining a few components and adding a new one.

We created a system of data-driven object creation, so that entirely new types of object could be created by the designers. This proved invaluable in the speedy creation and configuration of new types of objects.

Eventually the programmers came (at different rates) to embrace the component system, and became very adept at adding new functionality via components. The common interface and the strict encapsulation led to a reduction in bugs, and code that that was easier to read, maintain and re-use.

IMPLEMENTATION DETAILS

Giving each component a common interface means deriving from a base class with virtual functions. This introduces some additional overhead. Do not let this turn you against the idea, as the additional overhead is small, compared to the savings due to simplification of objects.

Since each component has a common interface, it is very easy to add additional debug member functions to each component. That made it a relatively simple matter to add an object inspector that could dump the contents of the components of a composite object in a human readable manner. Later this would evolve into a sophisticated remote debugging tool that was always up to date with all possible types of game object. This is something that would have been very tiresome to implement and maintain with the traditional hierarchy.

Ideally, components should not know about each other. However, in a practical world, there are always going to be dependencies between specific components. Performance issues also dictate that components should be able to quickly access other components. Initially we had all component references going through the component manager, however when this started using up over 5% of our CPU time, we allowed the components to store pointers to one another, and call member functions in other components directly.

The order of composition of the components in an object can be important. In our initial system, we stored the components as a list inside a container object. Each component had an update function, which was called as we iterated over the list of components for each object.

Since the object creation was data driven, it could create problems if the list of components is in an unexpected order. If one object updates physics before animation, and the other updates animation before physics, then they might get out of sync with each other. Dependencies such as this need to be identified, and then enforced in code.

CONCLUSIONS

Moving from blob style object hierarchies to composite objects made from a collection of components was one of the best decisions I made. The initial results were disappointing as it took a long time to re-factor existing code. However, the end results were well worth it, with lightweight, flexible, robust and re-usable code.

Resources

Scott Bilas: GDC 2002 Presentation: A Data-Driven Game Object System
http://scottbilas.com/files/2002/gdc_san_jose/game_objects_slides.pdf

Bjarne Rene: Component Based Object Management. Game Programming Gems 5, 2005, page 25.

Kyle Wilson: Game Object Structure: Inheritence vs Aggregation, 2002, http://gamearchitect.net/Articles/GameObjects1.html

Comments (120)

Multi-core Processors

Filed under: Game Development,Inner Product — Mick West @ 10:19 am

This article originally appeared in the â€œInner Productâ€ column in Game Developer Magazine, February 2006

Next generation game platforms are increasingly making use of multiple processors to deliver more computing power. The Xbox 360 has three symmetrical CPUs running two hardware thread each. The PS3 has one master CPU with seven separate â€œSPEâ€ processors. The Nintendo Revolution is probably dual threaded. Newer gaming PCs have dual core hyper-threaded processors.

Programmers who were used to programming for a single core are now increasing faced with the challenge of translating their programming skills to a multi-core system. Ideally you want to fully utilize the power of all the processors that you have available to you. A processor idling is a processor wasted. In this article I describe some of the issues involved with this, and discuss a few potential ways developers might architect their engines to more fully utilize the potential processing power.

DONâ€™T DO IT?

The simplest option is to just ignore the additional CPU resources, and run all your code from a single thread on a single CPU. (See Figure 1).

All the examples will use this simplified version of the systems in a game engine. Note that the â€œrenderingâ€ portion here is the CPU contribution to rendering (things like high level visibility determination, and preparing the data stream for the GPU to process). The GPU processing is not shown.

Here the flow of execution is very straightforward. We have the game physics, followed by the game logic, and finally the CPU contribution to rendering. Each of these very high-level tasks also has a strict order of mid-level tasks. The physics, for example, first processes the rigid bodies, then the IK animation, and finally the particle systems. This set of systems here is roughly what you might find in a current single processor game.

While this option is obviously not going to take advantage of any of the additional processing power, it does greatly simplify the task of programming the game. Whether this is the best choice depends very much on the type of game you are programming, your schedule, and the programming resources available to you.

Remember the GPU is still responsible for a large portion of the visual quality of your game. If the game contains a relatively low amount of moving and interacting objects (such as a golf game), then you may well be able to get away with using a single processor.

The decision to go single threaded can also greatly simplify the debugging process. Multi-threaded programs can be very difficult to debug. If your schedule is short, and if youâ€™re already got a single-threaded code-base up and running, then it may well be cost effective to simply ignore the extra CPUs, at least for this version of the game.

MANUALLY THREADED TASKS

Some mid level tasks are sufficiently decoupled that they can be executed in any order, and even in parallel. If you are able to identify two such tasks, then in theory you should be able to save the processing time that would be taken up by the shorter of the two tasks, by running the two tasks in parallel. (Figure 2)

Here we see the second core has been partially utilized. The rigid body physics and the IK animation are now running concurrently. The time saved here is again the shorter of the two tasks (the IK animation).

The logic portion of the pipeline is not threaded. Typically things such as AI and game object logic are very dependent on the order of execution, and so it is difficult to split an existing AI system into separate threads without introducing a large number of interesting bugs.

Rendering also has some parallelism shown, with the object rendering running at the same time as the environment rendering.

This approach of selectively shifting certain large tasks into separate threads (and onto separate CPU cores), is something you might do as a quick and easy step when retro-fitting an existing engine to be take advantage of multiple cores. The implementation will vary greatly according to the dependencies between systems in your engine. For example, it might be impossible to perform environment rendering at the same time as object rendering if they both share some resource, such as a pool of vertex buffers, which is not thread safe.

The disadvantage is that your have a fairly limited number of pieces of processing with which to play. There are very few ways in which you can arrange things so that your engine will still work, and hence you will end up making poor use of the second CPU core.

Since the parallel tasks are doing very different things, you will also not be making the best use of your shared cache on a hyper-threaded system.

FORKING FROM A SINGLE THREAD

If your high level systems are not suitable for running in parallel, then you should turn to the low level. Many systems comprise of running a similar piece of code over a large number of objects. If the order of processing of the objects is not important then you can start a very short-lived second thread to process half the objects.

You can visualize this as your main thread â€œforkingâ€ into two (or more) threads, each thread processing some of the data, and skipping over unprocessed data, until all the data is processed. The forked thread(s) then terminate, leaving the main thread to continue along the normal path of execution.

Figure 3 shows how the flow of execution works. The main thread in Core 1 is identical to our simplistic single threaded â€œsolutionâ€ in Figure 1. The only difference is that when the main thread is able to process a number of objects concurrently, it forks to process those objects, then continues as a single thread.

Maintaining the original order of execution is a great advantage in simplifying the interactions between systems in your engine. This kind of execution also makes better use of shared caches, since the two threads are running the same code and operate on nearby data.

Figure 4 shows how the threads might be forked down at the object level. To reduce overhead, the forks normally process batches of objects. Here we are processing ten objects at a time. First we see a brief setup phase for one set of objects of type A. Then both cores simultaneously begin processing objects, with core 1 processing objects A0-A9 and core 2 processing A10-A19.

In this example, the processing of objects takes different amounts of time. Here core 2 finishes its batch first, and immediately starts on objects A20-A29. Shortly after that, core 1 quickly starts and finishes off A30-A39.

You can clearly see here that the order of processing the objects cannot be guaranteed. It will probably not be the same from one frame to the next, and will very likely differ from one supposedly identical run to the next (say with input recorded for a demo). This can lead to obscure bugs when objects are mostly processed in one order, but very occasionally are processed in a slightly different order, revealing a bug that is very difficult to track down.

If you can work with that, then this is a very powerful technique. It is especially nice in that it scales very well to larger numbers of CPU cores. If you want to target the Xbox 360, then you can simply fork six times, for the six hardware threads on the three cores of the 360.

SIDESTEPPING AMDAHL

All the methods so far suffer from Amdahlâ€™s law. Amdahl notes that a large portion of a program execution is unavoidably serial, so you are limited by the time it takes to execute that code, no matter over how many processors you spread the execution of your parallelized code.

It is possible to sidestep Amdahlâ€™s law to some degree by splitting your code into multiple parts, and running them in parallel in what is essentially a multi-stage pipeline.

Figure 5 shows how this looks during a single frame of execution. Both cores are 100% utilized, with core 1 running logic and rendering, while core 2 is fully devoted to physics.

Obviously you are going to run into problems if you try to run logic and rendering on an object in the middle of its physics update. The solution here is to double buffer the output of the physics, and have the logic and rendering actually work with the results of the physics simulation from the previous frame.

Figure 6 shows how this works over the course of three frames. It is easier to understand if you start on the second frame. Here the logic and rendering is running on core 1 using the physics data that was calculated in the first frame. At the same time, core 2 is running the physics simulation that will be used by the logic in frame 3.

This architecture should be very familiar to anyone familiar with the way rendering works on a console such as the PS2. There the GPU is effectively another processor, and (usually) runs in parallel with the CPU, rendering the graphics that were prepared by the CPU on the previous frame. Here we are just extending the pipeline.

In theory you could separate out this pipeline even more if you have addition CPU cores to play with. You could pipeline logic and rendering on two separate threads, so rendering for frame 1 happens at the same time as logic for frame 2 and the physics for frame 3.

But there are number of problems with extending it in this way. Firstly itâ€™s unlikely that you will find the various systems take the same amount of time to execute, so you will be wasting some CPU power. Iâ€™ve artificially show the physics taking the same time as logic and rendering combined, but the idea here is that physics usage by the game designers will expand to fill the physics processing capability available.

A potentially more serious problem is the introduction of lag into the game. With a multi-stage pipeline, it could take several frames for the users input to reach the screen. Even on a single threaded game, it will take on average 2.5X the duration of a frame for the input of the player to affect the image seen on screen.

Thatâ€™s usually acceptable, but if you add another stage to the pipeline, then it jumps to 3.5X the duration of a frame. At that point if you are running at 30fps (shame on you!), then it could take a maximum of 4×1/30 seconds, or 133 milliseconds. That kind of lag is barely acceptable on the internet. Running at 60fps cuts your lag down to 66ms.

PPU AND GPU

A PPU is a physics processing unit, basically a very fast second processor dedicated to the types of computations that are frequently used in physics simulations. A PPU can take the place of core 2 in figures 5 and 6. A PPU should be able to handle vastly more objects than a general-purpose CPU core.

A GPU is a graphics processing unit. But modern GPUs have become so incredibly powerful that it is possibly to do some physics processing. The problem with GPUs is that they are designed to run as a one-way pipe at the end of the pipeline. Itâ€™s not really efficient to take the result of calculations from the GPU and feed them back into the start of the pipe.

However, what you can do is separate out that part of the physics that is primarily for cosmetic effects, and need not affect the rest of world. Things such as particles systems, explosions and debris still need physical simulation to move convincingly in the world. The GPU can take over the task of the minimal movement and collision calculations required by these â€œphysics effectsâ€, and incorporate that at the rendering stage. Since this all takes place one the GPU it greatly reduces the bandwidth required to simulate huge amounts of physics effects.

The best of both worlds would be a graphics card that contains both a PPU and a GPU.

WHAT TO DO?

A hybrid approach may work very well. There are more complex ways of doing this, and there are many other considerations, but combining the physics pipelining of figure 5 with the task forking of figure 3 will allow you to tune your approach to various target architectures in a reasonably straightforward manner. Pipelining your physics in a separate thread will also allow your engine to eventually take advantage of a PPU processor.

RESOURCES

David Wu, Physics in Parallel: Simulation on 7th Generation Hardware.
https://www.cmpevents.com/Sessions/GD/PhysicsinParallel.ppt

Intel Corporation, Threading Methodology: Principles and Practices
http://cache-www.intel.com/cd/00/00/21/93/219349_threadingmethodology.pdf

Henry Gabb and Adam Lake, Threaded 3D Game Engine Basics, Gamasutra,
http://www.gamasutra.com/features/20051117/gabb_01.shtml

Comments (0)

January 4, 2007

Mature Optimization

Filed under: Game Development,Inner Product — Mick West @ 6:39 pm

This article originally appeared in the â€œInner Productâ€ column in Game Developer Magazine, January 2006

Last month I described an optimization that needs to be done early on in a project, if it is to be done at all. This month I expand on the theme of early optimizations, and present a few examples of what I call â€œMature Optimizationâ€.

CONVENTIONAL WISDOM

Every year a fresh crop of young programmers enters the games industry with the wisdom of books and lectures swirling around their heads. One pearl of wisdom is drilled deep into their tender programming muscles; the pearl sometimes known as â€œHoareâ€™s Dictumâ€:

â€œPremature optimization is the root of all evilâ€ â€“ C.A.R. Hoare.

Unfortunately, in the mind of the freshly minted programmer entering the games industry, this quote seems to turn into:

â€œEarly optimization is evilâ€ â€“ Junior Game Programmer, 2005

The legendary Donald Knuth most famously weighed in on this subject when he wrote these often quoted (and often misleadingly punctuated) sentences:

â€œWe should forget about small efficiencies, say about 97% of the time. Premature optimization is the root of all evilâ€ â€“ Knuth, Literate Programming, 1992, p28.

Nobody wants to argue with Knuth â€“ it is the equivalent of arguing against the second law of thermodynamics. However, the problem here is that too often people misunderstand what Knuth was getting at. Ask a game programmer when optimization should be done, and they will tell you â€œafter the code is written, we can profile it, see where the bottlenecks are, and eliminate themâ€.

CONVENTIONAL GAME OPTIMIZATION

Often that is exactly what happens. Towards the end of the project we profile the code, and find that 20% of our CPU time is taken up by a function like HandleMessageEvent(), so we track down why thatâ€™s taking so much time, then we fix it.

Then we tackle our next function on the profilerâ€™s list; maybe one that takes 10% of the CPU time. We fix that, and then tackle a few more. Our game now runs 30% faster, and our fattest function takes just 1% of our CPU time.

However, if after that the code is still too slow, you will now find it is not because of a few weighty functions, but rather a general problem with all of the code. The code is just generally inefficient. By ignoring optimization until the end of the project you have ended up with â€œsick codeâ€ that contains multiple inefficiencies too deeply ingrained within the fabric of the code to be removed in a timely manner.

WHAT KNUTH REALLY SAID

When Knuth said, â€œwe should forget about small efficiencies, about 97% of the timeâ€, he was actually defending something he just did on the previous page, which was to optimize a loop to make it 12% faster, using â€œgotoâ€.

Knuth then states my main point here â€“ something contrary to what most programmers initially seem to take from Hoareâ€™s Dictum

â€œThe conventional wisdom shared by many of todayâ€™s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who canâ€™t debug or maintain their â€˜optimizedâ€™ programsâ€ (Knuth, 1992, p28)

There you have it. Conventional wisdom was wrong in 1992. Sadly, conventional wisdom continues to be wrong 14 years later. Programmers, even game programmers, are still starting out with the misconception that optimization is evil, and so they ignore optimization until the end of the project.

MATURE OPTIMIZATION

What is evil here is a very specific type of optimization: â€œPremature optimizationâ€ which is optimization done early in development without a real understanding of if it is necessary, and which may have an overall negative effect on the project.

Given we have an evil type of optimization called â€œpremature optimizationâ€, It then follows that there is a contrasting type of optimization that is not evil, which we naturally call â€œMature Optimizationâ€.

Mature optimization is any optimization done early in development that you know in advance will provide a significant performance boost without unmanageable side effects. Mature optimizations are often well-known techniques that were successfully used before. Mature optimizations include small local modifications, coding standards, and architecture-level design decisions.

GAME OPTIMIZATION

Game programming is complicated, and getting more complicated every year. I think itâ€™s now more complicated than either Hoare or Knuth really could appreciate. A game engine is comprised of a large number of systems, each handling a part of the game, and each system is itself often vastly more complex than an entire game of 20 years ago. Since each system is so complex, and each system only contributes a little to CPU usage, then individual optimizations are not often going to have much effect on overall performance. To ensure our game runs as fast as possible, we need to optimize our code as it is developed, using every mature optimization available to us.

Developing a game is more and more about developing the content for that game (rather than developing the code). When developing that content, you cannot really get a good idea of how the game plays unless the code is running at the speed you hope it will run at when you ship the game. This is especially true of action and sports games, where the â€œfeelâ€ of the game is a vital part of the gameplay, and very important in establishing a difficulty curve. Thus, it is beneficial to do as much optimization as possible in the early stages of a project.

Most games are real-time applications. A very large chunk of the code is running thirty (or hopefully sixty) times per second. In developing business applications, a response time of a large fraction of a second for a single transaction is quite acceptable. In the games industry we need to advance the state of the entire game in just 0.0166 seconds. It is very rare for a single transaction or operation (say total collision detection) to be budgeted for more than one millisecond (0.001 seconds).

Processing of triangles and other graphics primitives has mostly been off-loaded to the GPU. The CPU is mostly concerned with iterating over game objects, and doing things with them. Optimization of games even a few years ago was often heavily weighted towards optimizing the rendering loop. The balance here has been shifting over the years to the bulk of the CPU time being used by game logic, AI and physics. The result is that there is less â€œlow hanging fruitâ€ to optimize.

The relatively high power of modern CPUs has also contributed to a shift toward physic calculations taking up more and more of the CPUâ€™s power. Here again it is important that optimization be completed early, so that the designers can design their game scenarios within well-defined limits as to the complexity of the physics simulations they can use, specifically the number of objects in a scene. If you leave the optimization until too late, then you are either going to have a scene that runs too slow (and may require some last minute butchering), or a scene that is much lower in complexity than the engine can support, and which looks weak compared to the competition.

Given that overview of the issues relating to optimizing game, I will now give a few representative examples of mature optimizations.

AVOID ALLOCATIONS

Memory allocations are expensive. If you have some system that is doing a large number of small allocations from the main heap, on a continuous basis, then consider making some kind of custom allocator. A good example of this is a particle system. If you know a particle system has a maximum of 1024 particles, then you should use a high-speed pool rather than a heap.

This kind of memory system optimization is well known in game development. However, it does fall into the category of a mature optimization, since it can be tricky and dangerous to replace one method of memory allocation with another toward the end of a project.

AVOID RTTI

RTTI, or Run Time Type Inspection, is what allows you to use <dynamic_cast> on a base class pointer to see what type of derived class it actually is. The problem with RTTI is that it is very slow. It depends on the specific compiler implementation, but on many compilers a <dynamic_cast> is implemented using the strcmp function on the name of the class. If your profiling shows that strcmp is registering above 0.1% of CPU time, then suspect RTTI.

Instead of using <dynamic_cast> you should first identify those classes that require some kind of run-time type inspection, consider if you really need it, and then incorporate a type member variable into the base class, with an enum providing the type numbers. You then do your own RTTI by checking the type variable to verify the type (if needed), and then use <static_cast> to do the actual cast.

The reason avoiding RTTI qualifies as a mature optimization is that it can be almost impossible to apply towards the end of the project. If a programmer starts implementing a complex system with many usages of RTTI, then it can be a significant undertaking to remove it. Adding your own optimized version of RTTI is best done early.

AVOID ALIASING PROBLEMS

Aliasing is an insidious problem with compiled code. Aliasing itself is a straightforward concept â€“ the same piece of memory can be pointed at by two different pointers. The performance problem occurs when the compiler assumes that the contents of memory can be changed by something other than the current code, and hence will continually re-fetch something from memory when you the programmer know that itâ€™s not going to change.

How you handle aliasing problems depends on your compiler. You could simply turn off aliasing entirely. You might be able to use the â€œ__restrict__â€ keyword selectively to tell the compiler that a pointer cannot be aliased. Both of these solutions have the characteristics of a mature optimization, in that they are best done early in the project if they are able to have an effect safely. It is also a mature optimization in that it often takes a mature programmer to spot it as a problem. Detecting aliasing as an issue can require that you look at the disassembly to see exactly what instructions are being generated.

AVOID PER-FRAME CODE

Not everything needs to be done 60 times a second. A huge saver of CPU time is to make certain aspects of the game logic run at a frame rate slower than the screen refresh rate. Most physics will look quite reasonable running at 30 frames per second. Some types of logic, such as path-finding, can be spread over a large number of frames.

This is a good example of a mature optimization. Many games use this form of optimization. It can be a massive time saver, cutting as much as 20% to 50% off your CPU time. It is something that is very difficult to add late in the project. The change in timing from running logic synchronized with the rendering frame advance, to logic running independently can introduce all kinds of obscure timing bugs.

AVOID ITERATIONS

Doing any single thing on a modern CPU takes no time at all. Where we fall over is when we need to do that single thing multiple times. Whenever you are iterating over some list of things in your game, consider how you might be iterating over fewer things, or not iterating at all.

For example, let us say we want to automatically target enemies that are within range. The simple way to do this is to iterate over all objects in the game, see which can be targeted, and of those see which are within range.

The problem here is that the list of all object might only contain a few that are actually targetable. You could optimize this by maintaing a separate list of all the objects that can be targeted. When targetable objects are created, and destroyed, they are added to, and removed from, this list. Then when you want to target something, you just iterate over the shorter list.

PROFILE INLINE FUNCTIONS

When profiling your code at a function level, most profilers will not be able to give you any kind of reading for inline functions. This is because the optimizing compiler will interleave the inline functions instructions with the instructions in the calling function, making it impossible to tell when the inline function is being executed.

While this is expected behavior, it can often hide problems with weighty inline functions that are either too slow, or called too often. The culprit will often show up as some large higher level function, which calls several inline functions.

To get a clearer picture of exactly what is going on here, you can turn off the expansion of inline functions just for the compilation unit that contains the high level function. This should reveal the inline functions to the profiler. Recognize that it is not an entirely accurate picture of what is going on, but it will give you an indication of the balance of CPU usage between the high level function, and the inline functions, and hence indicate on what you need to focus your optimization efforts.

PROCESS OFFLINE

Load times for games need to be short. Yet too often we get to the end of a project and find the load times are over a minute. Some advance optimization here can save you a lot of trouble later.

The most obvious optimization is to load your assets all as one big file (a ZIP, PAK or WAD file), rather than a large number of individual files. This is a very mature optimization â€“ all developers should be using this technique, especially when developing for consolesâ€“ and yet it is surprising just how often the problem comes up. Developers often start off by loading individual files, as at the start of development, there are simply not many files to load, and the load times are bearable.

Towards the end of a project, there is a rapid increase in the number of assets in the game, and load times soar. Patching a quick load system at this point can be very difficult, especially with limited memory available on a console to act as a buffer. You will be forced to do some kind of optimization at the end of a project. Do it early. Do it maturely.

Practical Hash IDs

Filed under: Game Development,Inner Product — Mick West @ 6:28 pm

This article originally appeared in the “Inner Product” column in Game Developer Magazine, December 2005

Game assets frequently need to be referenced indirectly by the engine, by code and by scripts. The simplest way to do this is by using the name of the object. However this is very inefficient and otherwise problematic. The article explains the practice of using a 32 bit hash of the name as a unique ID and explores some ways of integrating this into your code and tool pipeline.

PREMATURE OPTIMIZATION?

This article is about an optimization. Optimizations are typically done at the end of a project. Donald Knuth famously quotes Tony Hoare saying: â€œpremature optimization is the root of all evilâ€. But he prefaces this with an important caveat: â€œabout 97% of the timeâ€.

This optimization is one of those 3% (or more, opinions vary) best done early. Itâ€™s such a fundamental change that it affects many aspects of the game engine code and data structures. Itâ€™s not an optimization you do at the end to speed things up, but rather one you do at the start, to enable you to do more right from the early stages of development.

THE NEED FOR ASSET IDS

A game contains a very large number of assets. Meshes, textures, animations, skeletons, scripts, sounds, effects, triggers, events, NPCs, and various other miscellaneous resources, objects and entities.

These assets all need some way of being referenced. Either as part of some internal linkage (a mesh will reference the textures it needs), or as some intentional reference on the part of the designer (â€œplay the Man_jump_01a animationâ€, â€œblow up building City04_Bank01â€).

These references need to be resolved at run-time, as this gives you a much more dynamic development environment than statically linking everything together at compile time. Assets can be changed independently and re-loaded. Small sections of the game data can be recompiled quickly without having to recompile everything. (See Sean Barrettâ€™s â€œLate Binding Dataâ€, Game Developer, Feb 2005).

The assets will nearly always have a name. Textures have the file name of the original .PSD, .PNG or .TIFF. Scenes exported from 3D Studio Max or a similar package will have a name associated with each object in the scene.

So the simplest way of referencing an asset is to just use the string containing the name. This has a great advantage in that itâ€™s very straightforward. However there are several disadvantages.

PROBLEMS WITH STRINGS

Firstly, string take up a surprising amount of memory. Asset names are often quite long, especially if paths are included. So if every mesh has to reference every texture it uses by name, then that can add up to an unnecessarily large amount of memory.

Secondly, itâ€™s slow. Comparing two strings takes a lot longer than comparing two numbers. Quite often asset names will have long common prefixes, meaning you get a lot of false partial positives if you are looking up the string in a table. Plus the length of the string means you need to perform a large number of memory accesses, which can be very slow on architectures with expensive data reads, such as the PS2.

Thirdly, it does not fit into data structures nicely. The strings are of random length, making it difficult to form neat data structure around them

Fourthly, having a large number of strings visible in the data might be something of a security concern. Plot spoilers might be deduced from the names of assets (â€œHey, whatâ€™s this â€˜MDL_SPIDERWOMANâ€™? Sweet!â€). And if you want to stop people hacking left-over content in your game (like the infamous â€œhot coffeeâ€ mod in GTA-SA), then youâ€™re better off not giving them a load of clues.

USE THE HASH VALUE

So, instead of using the string itself, one approach is to use a hash value of the string. As you probably know, a hash value is a number calculated by mangling the bits of the string together using some algorithm to produce a fixed number of bits. There are various hash algorithms that produce various sized outputs. Algorithms such as SHA, MD5 or RIPEMD-160 produce hash values of 128 or 160 bits in length (16-20 bytes) and so donâ€™t really help us very much.

The string hash algorithm that best fits our needs here is CRC-32. CRC-32 takes a string and gives you a 32-bit value. The algorithm is specifically designed so that small differences in a string produce very large differences in the output. This is very important, since you donâ€™t want string such as â€œANIM_10002â€ and â€œANIM_20001â€ to result in the same hash value (as they would if you did a traditional checksum where you added or XORed the bytes or words together).

Using the hash in place of the string immediately removes the problems I mentioned above. Comparisons are now very fast, since you only need to compare two 32 bit values. Memory accesses are reduced to a bare minimum. Data structures can now have a nice neat single 32-bit word to store the reference rather than a variable length, memory hogging string.

Using a hash has other benefits. Since itâ€™s a hash value, you can use it as a very fast index into a hash table simply by lopping off however many bits you need â€“ for an 8K table, just used the bottom 12 bits. Or even easier, you can pick 12 bits out of the middle of the word to match the power of 2 alignment that your table is on (so, if your hash table index is 8 bytes per entry, then just mask off bits 14 through 3, and youâ€™ve got the offset into your hash table index.

Another useful technique can be used when game assets need to have pointers to each other. By using a 32-bit hash you can write out a data structure identical to that used in the game, but with the hash values in the memory location usually occupied by the pointers. Then when the asset needs to be loaded and bound into the game, you simply replace the hashes with the appropriate pointers. This removes the need for building data structures from parsed data, and can greatly speed up the loading process, while still giving you the flexibility of late-binding data.

FEAR OF COLLISIONS

So why donâ€™t people us hashes for IDs? Well the first thing people think when presented with this idea is â€œwhat if thereâ€™s a collision?â€

A collision is when two strings have the same hash value. Since with a 32 bit hash value there are only about 4 billion possible hashes, then if you throw enough strings in, then you will eventually find two that have the same hash value.

The probability that any two strings will have the same hash value is 1:4,294,967,296 (about 1 in four billion). With a larger number of strings, the probability rises. At about 65,000 strings there is an approximately even chance that there will be a collision â€“ at least one pair of distinct string will have the same CRC-32 hash value.

For a practical example, I did a check for collisions on 89,000 unique string culled from some random dump of a dictionary, and there was only one collision. The strings â€œbellysâ€ and â€œcochleariformâ€ both have the same CRC32 (0x5D1F90F6). Later when checking a list of 216,000 strings, I only got seven collisions.

Remember there is a big difference between finding a collision somewhere in a large set of strings, and the possibility of a new string causing a collision. If youâ€™ve already got 89,000 strings with no collisions, and you add one more, the chances that there will now be a collision are 89001/232 or about 1 in 50,000.

So depending on how many IDs you actually use, you are going to have to deal with collisions very rarely.

DETECTING AND HANDLING COLLISIONS

But collisions will occur, and at some point you will need to detect collisions, and then deal with them.

To detect collisions, you need to keep an automated central database of all the strings and the hashes that are used. This database is solely used during development, so wonâ€™t require any overhead in the actual game. It need not be complex, just a simple list will usually suffice, and you donâ€™t need to worry about removing unused entries â€“ just purge it whenever you do a full rebuild.

When you find a collision, the simplest way to deal with it is to change the string. Now I know that this sounds like a rather hacky solution. But itâ€™s really not going to happen very often. Youâ€™ll perhaps have to change a string once a month towards the end of the project, and not at all during the first half of the project, when the number of unique strings is still very low.

USING HASHES IN CODE

The other reason that people donâ€™t use hashes is that they canâ€™t read them. A simple implementation of this system would involve someone running a CRC32 program on the string, and then manually typing in the checksum into the code where the asset was being referenced. Instead of typing â€œLoadAsset(â€œMDL_ENEMY_GUNSHIPâ€)â€ , you type LoadAsset(0x23C55C3A). Nasty.

Ideally the user (in this case the programmer or designer) should never have to see these hexadecimal strings. The entire process should be automated to make it both easier to read, and less prone to errors. (Iâ€™ve cut and paste the wrong hash value a few times in the past).

When dealing with compiled data such as meshes, animations, textures etc., this is relatively straightforward. You just compile the strings into hashes when you compile the data into whatever engine specific format you need.

When you need to reference an asset from C/C++ code, the story is somewhat different. Here are four ways of doing this, in the order I first considered using them.

1) Just use the hash values directly â€“ Not a very good option. Youâ€™ll type them in wrong, thereâ€™s no easy way of figuring out if itâ€™s the right hash value. You can improve things slightly by adding a comment next to the hash value, but itâ€™s still a messy and fragile method.

2) Use #defines – In this system you add all your strings to a header file (say crc_hashes.h), with a line for each possible string. (listing 1).

Listing 1 â€“ a header file containing all the hash values

// crc_hashes.h
#ifndef __CRC_HASHES_H__
#define __CRC_HASHES_H__

#define CRC_MDL_MECH01_LARGE    0x4854987A
#define CRC_MDL_MECH02_LARGE    0x374FAAD2
#define CRC_ANM_FOX_JUMP        0xF003C8A8
// add additional crcs here ...
#endif

Then in the code you just use the CRC_ symbols. This initially seems like a reasonable option, but suffers from the huge downside that you have to include crc_hashes.h in every single file that uses a hash value. This effectively means that whenever you add a new hash, you need to re-compile a significant portion of the game.

3) Use a macro that checks the hash value, but is compiled out in â€œshipâ€ mode. (listing 2). With this you enter in both the string and the hash value directly into the code. (see example in listing 3).

Listing 2 â€“ a macro for checking your hashes at runtime in debug mode.


inline uint32 check_hash(uint32 id, const char *name)
{
    assert(id == crc32(name));
    return id;
}

#ifdef DEBUG
#define HASHCHECK(id, name) check_hash(id, name)
#else
#define HASHCHECK(id, name) id
#endif

Listing 3 â€“ Showing the different ways of using hash values in code.

// hash value directly:
    Load(0x334BDCF8);
// hash value with inline comment
    Load(0x334BDCF8 /*MDL_MECH01_LARGE*/);
// using #define in a common header file
    #include crc_hashes.h
    ...
    Load(CRC_MDL_MECH_01_LARGE);
// Using a macro to check vlues
    Load(HASHCHECK(0x334BDCF8, "MDL_MECH01_LARGE"));
// Using additional preprocessing
    Load(HASH("MDL_MECH01_LARGE"));
// The same line after the custom build step:
    Load(0x334BDCF8 /*MDL_MECH01_LARGE*/);

This ensures your hash values match your strings. In ship mode it will compile away to leave just the hash value itself. This works well enough, and itâ€™s relatively straightforward to set up a hot key your editor to automatically generate the hash, and the call to HASH(), and insert them in the code, so you donâ€™t have ever deal with the hash value directly.

However there are still problems. Firstly youâ€™ve still got the hash value there in the code, leaving you open to cut-and-paste problems. Secondly the check is at run-time, so if itâ€™s an obscure bit of code (maybe it only runs at the end of the game?), then the check might not fire until a few weeks after you add it. Thirdly itâ€™s slow in debug mode since you have to generate a checksum for each string.

As I wrote in another article (Game Developer, Dec 2005), I donâ€™t think there should be separate DEBUG and RELEASE modes, but rather a single DEVELOP mode, used for both purposes. Having the code check the value of a hash value against a string is slow, and itâ€™s done too many times. The hash value might be used several times per frame, but if itâ€™s right once itâ€™s always going to be right. You can hack the macro so it only checks once, but youâ€™ve either got to add another check, or write some self modifying code.

4) Use a custom preprocessor. With this approach everything is automated for you, you simply use the string itself and tag it by placing it as a parameter a pseudo-function, called HASH(). You then need to introduce a custom build step to your build process that scans your source for calls to HASH(), and replaces them with the correct hash value. (See listing 3 for a comparison off this all the methods)

The great benefit of this system is that the programmer never has to see (or type) another hex string in order to get something in the game. Once the system is set up, the programmer can continue to refer to assets by name, and the optimization of using hash values will take place under the hood. The disadvantage is that you have to add a custom build step that generates intermediate files to be compiled. If you already have such a step itâ€™s probably not such a big deal to extend it to include this. But changing your build process is not a trivial task.

Introducing a custom build step runs the risk of making your code less portable. A new platform might not be as flexible in its compiler options. To get around this, you can define the function HASH() as a function that calculates and returns the checksum of the string passed to it. That way, even if the additional compilation step is missing, the code will still work perfectly (albeit a little slower).

One additional place in your code where the use of these string id hashes needs special consideration is in switch statements. Since the value after a case keyword must be an absolute value, and not the return value of a function, then using the HASHCHECK() method will not work. A custom preprocessing step will work fine, as the value is automatically inserted and there is no use of macros. However now the fallback method of also defining the HASH() function will not work for switch statements.

OTHER CONSIDERATIONS

By default, CRC-32 is case sensitive. For this use though it is probably a good idea to make it case insensitive. This is especially true if you are going to use hashes in some kind of script environment used by designers. Since you never really want to have MDL_BigJump01 mean something different from MDL_Bigjump01, then it makes everyoneâ€™s life easier. Making your CRC routine case insensitive will also not make it any slower if you use a table based method (just duplicate the upper case entries with the lower case entries).

If you are using hashes to uniquely identify objects in the game, you might want to reserve some for specific system objects. You could do this by adding a few names like SYSTEM_0001, SYSTEM_0002, etc. Or you can simply reserve the first (say) 1024 values. (0x00000000 through 0x000003FF), and explicitly report them as illegal values in your database. This might seem a little odd, not allowing this range of values for a hash, but only one out over every 4,194,304 possible strings will have a hash value in this range.

Given a hash value from a string A (HASH(A)), and given another string B, then you can calculate HASH(A+B) by calculating HASH(B) using a starting value of HASH(A). This means you can calculate HASH(A+B) without ever knowing the actual contents of the string A. This is very useful if you have assets that have a series of different extensions or other appended strings (e.g.: MDL_ROBOT_01_ might have MDL_ROBOT_01_LEFT_ARM, MDL_ROBOTO_01_LEFT_LEG, etc.) You can quickly calculate the hash values of the string with the extensions without having to know the original string.

RESOURCES

A description of the CRC-32 algorithm.
http://en.wikipedia.org/wiki/CRC32

A Practical and Public Domain implementations of CRC32
http://www.csbruce.com/~csbruce/software/crc32.c

Comments (2)

« Newer Posts — Older Posts »

Cowboy Programming Game Development and General Hacking by the Old West

January 5, 2007

Multithreading Particle Sytems

Evolve Your Hierarchy

Multi-core Processors

January 4, 2007

Mature Optimization

Practical Hash IDs