Mature Optimization « Cowboy Programming

January 4, 2007

Mature Optimization

Filed under: Game Development,Inner Product — Mick West @ 6:39 pm

This article originally appeared in the “Inner Product” column in Game Developer Magazine, January 2006

Last month I described an optimization that needs to be done early on in a project, if it is to be done at all. This month I expand on the theme of early optimizations, and present a few examples of what I call “Mature Optimization” .

CONVENTIONAL WISDOM

Every year a fresh crop of young programmers enters the games industry with the wisdom of books and lectures swirling around their heads. One pearl of wisdom is drilled deep into their tender programming muscles; the pearl sometimes known as “Hoare’s Dictum” :

“Premature optimization is the root of all evil” – C.A.R. Hoare.

Unfortunately, in the mind of the freshly minted programmer entering the games industry, this quote seems to turn into:

“Early optimization is evil” – Junior Game Programmer, 2005

The legendary Donald Knuth most famously weighed in on this subject when he wrote these often quoted (and often misleadingly punctuated) sentences:

“We should forget about small efficiencies, say about 97% of the time. Premature optimization is the root of all evil” – Knuth, Literate Programming, 1992, p28.

Nobody wants to argue with Knuth – it is the equivalent of arguing against the second law of thermodynamics. However, the problem here is that too often people misunderstand what Knuth was getting at. Ask a game programmer when optimization should be done, and they will tell you “after the code is written, we can profile it, see where the bottlenecks are, and eliminate them” .

CONVENTIONAL GAME OPTIMIZATION

Often that is exactly what happens. Towards the end of the project we profile the code, and find that 20% of our CPU time is taken up by a function like HandleMessageEvent(), so we track down why that’s taking so much time, then we fix it.

Then we tackle our next function on the profiler’s list; maybe one that takes 10% of the CPU time. We fix that, and then tackle a few more. Our game now runs 30% faster, and our fattest function takes just 1% of our CPU time.

However, if after that the code is still too slow, you will now find it is not because of a few weighty functions, but rather a general problem with all of the code. The code is just generally inefficient. By ignoring optimization until the end of the project you have ended up with “sick code” that contains multiple inefficiencies too deeply ingrained within the fabric of the code to be removed in a timely manner.

WHAT KNUTH REALLY SAID

When Knuth said, “we should forget about small efficiencies, about 97% of the time” , he was actually defending something he just did on the previous page, which was to optimize a loop to make it 12% faster, using “goto” .

Knuth then states my main point here – something contrary to what most programmers initially seem to take from Hoare’s Dictum

“The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their ”˜optimized’ programs” (Knuth, 1992, p28)

There you have it. Conventional wisdom was wrong in 1992. Sadly, conventional wisdom continues to be wrong 14 years later. Programmers, even game programmers, are still starting out with the misconception that optimization is evil, and so they ignore optimization until the end of the project.

MATURE OPTIMIZATION

What is evil here is a very specific type of optimization: “Premature optimization” which is optimization done early in development without a real understanding of if it is necessary, and which may have an overall negative effect on the project.

Given we have an evil type of optimization called “premature optimization” , It then follows that there is a contrasting type of optimization that is not evil, which we naturally call “Mature Optimization” .

Mature optimization is any optimization done early in development that you know in advance will provide a significant performance boost without unmanageable side effects. Mature optimizations are often well-known techniques that were successfully used before. Mature optimizations include small local modifications, coding standards, and architecture-level design decisions.

GAME OPTIMIZATION

Game programming is complicated, and getting more complicated every year. I think it’s now more complicated than either Hoare or Knuth really could appreciate. A game engine is comprised of a large number of systems, each handling a part of the game, and each system is itself often vastly more complex than an entire game of 20 years ago. Since each system is so complex, and each system only contributes a little to CPU usage, then individual optimizations are not often going to have much effect on overall performance. To ensure our game runs as fast as possible, we need to optimize our code as it is developed, using every mature optimization available to us.

Developing a game is more and more about developing the content for that game (rather than developing the code). When developing that content, you cannot really get a good idea of how the game plays unless the code is running at the speed you hope it will run at when you ship the game. This is especially true of action and sports games, where the “feel” of the game is a vital part of the gameplay, and very important in establishing a difficulty curve. Thus, it is beneficial to do as much optimization as possible in the early stages of a project.

Most games are real-time applications. A very large chunk of the code is running thirty (or hopefully sixty) times per second. In developing business applications, a response time of a large fraction of a second for a single transaction is quite acceptable. In the games industry we need to advance the state of the entire game in just 0.0166 seconds. It is very rare for a single transaction or operation (say total collision detection) to be budgeted for more than one millisecond (0.001 seconds).

Processing of triangles and other graphics primitives has mostly been off-loaded to the GPU. The CPU is mostly concerned with iterating over game objects, and doing things with them. Optimization of games even a few years ago was often heavily weighted towards optimizing the rendering loop. The balance here has been shifting over the years to the bulk of the CPU time being used by game logic, AI and physics. The result is that there is less “low hanging fruit” to optimize.

The relatively high power of modern CPUs has also contributed to a shift toward physic calculations taking up more and more of the CPU’s power. Here again it is important that optimization be completed early, so that the designers can design their game scenarios within well-defined limits as to the complexity of the physics simulations they can use, specifically the number of objects in a scene. If you leave the optimization until too late, then you are either going to have a scene that runs too slow (and may require some last minute butchering), or a scene that is much lower in complexity than the engine can support, and which looks weak compared to the competition.

Given that overview of the issues relating to optimizing game, I will now give a few representative examples of mature optimizations.

AVOID ALLOCATIONS

Memory allocations are expensive. If you have some system that is doing a large number of small allocations from the main heap, on a continuous basis, then consider making some kind of custom allocator. A good example of this is a particle system. If you know a particle system has a maximum of 1024 particles, then you should use a high-speed pool rather than a heap.

This kind of memory system optimization is well known in game development. However, it does fall into the category of a mature optimization, since it can be tricky and dangerous to replace one method of memory allocation with another toward the end of a project.

AVOID RTTI

RTTI, or Run Time Type Inspection, is what allows you to use <dynamic_cast> on a base class pointer to see what type of derived class it actually is. The problem with RTTI is that it is very slow. It depends on the specific compiler implementation, but on many compilers a <dynamic_cast> is implemented using the strcmp function on the name of the class. If your profiling shows that strcmp is registering above 0.1% of CPU time, then suspect RTTI.

Instead of using <dynamic_cast> you should first identify those classes that require some kind of run-time type inspection, consider if you really need it, and then incorporate a type member variable into the base class, with an enum providing the type numbers. You then do your own RTTI by checking the type variable to verify the type (if needed), and then use <static_cast> to do the actual cast.

The reason avoiding RTTI qualifies as a mature optimization is that it can be almost impossible to apply towards the end of the project. If a programmer starts implementing a complex system with many usages of RTTI, then it can be a significant undertaking to remove it. Adding your own optimized version of RTTI is best done early.

AVOID ALIASING PROBLEMS

Aliasing is an insidious problem with compiled code. Aliasing itself is a straightforward concept – the same piece of memory can be pointed at by two different pointers. The performance problem occurs when the compiler assumes that the contents of memory can be changed by something other than the current code, and hence will continually re-fetch something from memory when you the programmer know that it’s not going to change.

How you handle aliasing problems depends on your compiler. You could simply turn off aliasing entirely. You might be able to use the “__restrict__” keyword selectively to tell the compiler that a pointer cannot be aliased. Both of these solutions have the characteristics of a mature optimization, in that they are best done early in the project if they are able to have an effect safely. It is also a mature optimization in that it often takes a mature programmer to spot it as a problem. Detecting aliasing as an issue can require that you look at the disassembly to see exactly what instructions are being generated.

AVOID PER-FRAME CODE

Not everything needs to be done 60 times a second. A huge saver of CPU time is to make certain aspects of the game logic run at a frame rate slower than the screen refresh rate. Most physics will look quite reasonable running at 30 frames per second. Some types of logic, such as path-finding, can be spread over a large number of frames.

This is a good example of a mature optimization. Many games use this form of optimization. It can be a massive time saver, cutting as much as 20% to 50% off your CPU time. It is something that is very difficult to add late in the project. The change in timing from running logic synchronized with the rendering frame advance, to logic running independently can introduce all kinds of obscure timing bugs.

AVOID ITERATIONS

Doing any single thing on a modern CPU takes no time at all. Where we fall over is when we need to do that single thing multiple times. Whenever you are iterating over some list of things in your game, consider how you might be iterating over fewer things, or not iterating at all.

For example, let us say we want to automatically target enemies that are within range. The simple way to do this is to iterate over all objects in the game, see which can be targeted, and of those see which are within range.

The problem here is that the list of all object might only contain a few that are actually targetable. You could optimize this by maintaing a separate list of all the objects that can be targeted. When targetable objects are created, and destroyed, they are added to, and removed from, this list. Then when you want to target something, you just iterate over the shorter list.

PROFILE INLINE FUNCTIONS

When profiling your code at a function level, most profilers will not be able to give you any kind of reading for inline functions. This is because the optimizing compiler will interleave the inline functions instructions with the instructions in the calling function, making it impossible to tell when the inline function is being executed.

While this is expected behavior, it can often hide problems with weighty inline functions that are either too slow, or called too often. The culprit will often show up as some large higher level function, which calls several inline functions.

To get a clearer picture of exactly what is going on here, you can turn off the expansion of inline functions just for the compilation unit that contains the high level function. This should reveal the inline functions to the profiler. Recognize that it is not an entirely accurate picture of what is going on, but it will give you an indication of the balance of CPU usage between the high level function, and the inline functions, and hence indicate on what you need to focus your optimization efforts.

PROCESS OFFLINE

Load times for games need to be short. Yet too often we get to the end of a project and find the load times are over a minute. Some advance optimization here can save you a lot of trouble later.

The most obvious optimization is to load your assets all as one big file (a ZIP, PAK or WAD file), rather than a large number of individual files. This is a very mature optimization – all developers should be using this technique, especially when developing for consoles- and yet it is surprising just how often the problem comes up. Developers often start off by loading individual files, as at the start of development, there are simply not many files to load, and the load times are bearable.

Towards the end of a project, there is a rapid increase in the number of assets in the game, and load times soar. Patching a quick load system at this point can be very difficult, especially with limited memory available on a console to act as a buffer. You will be forced to do some kind of optimization at the end of a project. Do it early. Do it maturely.

13 Comments »

Wise advices. Makes me wonder if using these wise advices is still considered as cowboy coding… :)

I’m more a software architect than a coder (but I have a fair coding experience as well), so my vision of optimizations is also geared toward software design.

The very basic thing is: Know you Language. And you’d better know it very well. The reason behind that is that you can spot the performance issues in your programming language and design your software in order to avoid them. For instance, you mentioned RTTI (which is a big hit in C++ (and probably in many other languages)). It’s not enough to replace the RTTI system by a faster, hand-crafted one – it will still be slower than a version which avoid using such kind of system. In my experience, design and algorithm are the too big keys to avoid the major performance hits – but that doesn’t mean that the code itself doesn’t mean to be tweaked.

As of today, premature optimization is no more a synonym for early optimization. Early optimizations is something that we can now achieve quite easily using techniques such as TDD to prevent the writting of non-functional code. I would restrict the idiom “premature optimization” to those optimizations that are done without any kind of pre-analysis – most of the case, the “optimized” code don’t have to be optimized, so that’s a complete waste of time.

On a side note, concerning offline processing and load time, I point you to Jan Wassenberg’s thesis: http://www.stud.uni-karlsruhe.de/~urkt/articles/study_thesis.pdf
It’s a very interesting read. Unfortunately, I’m not sure that this analysis can be applied to console programming (but you should know, don’t you?)

Best regards,

.E

Comment by Emmanuel Deloget — February 27, 2007 @ 11:46 am
For what it’s worth, with D, a dynamic cast comes down to comparing the address of the typeinfo; i.e. one cmp. (btw, it’s *Run-Time* Type Inspection). I actually (almost) screamed out loud when I got to that bit because it was such a ridiculously expensive way to do what ought to be a simple cast.
Apart from that, good article. Many thanks! :)

Comment by FeepingCreature — October 23, 2007 @ 2:46 am
The nature of a typeinfo comparison depends on the compiler. GCC uses pointer comparison on most targets, so dynamic_cast is very fast. However, this doesn’t always work with shared libraries that are loaded as plugins. For example, Boost.Python deliberately uses a custom typeinfo system based on string comparison because Python uses RTLD_LOCAL with dlopen/dlsym.

Comment by Jonathan — October 23, 2007 @ 6:52 am
[…] por Miguel Galves ligado Outubro 23rd, 2007 Aconselho a leitura do artigo Mature Optimization, escrito por Mick […]

Pingback by OtimizaÃ§Ã£o “madura” « Log4Dev — October 23, 2007 @ 7:05 am
Game programmers are in a different industry than other programmers. The typical game development life-cycle is a few years of high impact furious coding followed by intense QA and then release. The product sells well and all of the developers get a fat royalty check and retire. It’s a commercial success but as a software project it’s done. Birth to death in just a few years.

The rest of the programmers in the industry work on projects that, if successful, have lifespans of decades. They have different values than game development projects. Wise investments today pay off dividends years, if not decades from now. Investing in coding clarity, bug tracking, revision control, unit testing, portability, regression testing, security hardening, etc. are all investments that appreciate in value with the progression of the project. Optimization, on the other hand, depreciates rapidly because of Moore’s law: every dollar of programmer time spent today optimizing code is worth half as much in 18 months, and incurs even higher recurring costs in the long-term because everyone else who touches it has to scratch their head and figure out what’s going on.

In non-game programming projects, really fast code can give you a competitive oomph! in the marketplace, but this needs to be carefully balanced against the risks. Will all of this hand-tuned C code matter in 10 years when the processor it runs on will be 50+ times faster?

Comment by Michael B — October 23, 2007 @ 8:43 am
True, game programming is much more focussed on the NOW of optimization – the impact you make with the game reviewers and initial purchasers is very important. This article is really aimed at game programmers.

But I think there is still some applicability to “other” applications. While it’s true computers will keep getting faster, you still have to make an impression with your product in the CURRENT market. If you don’t sell enough now, it’s not going to be helpful that your product runs lickity-split in three years.

The broader point here is that some optimizations are essentially either architecture optimizations, or are low-level, yet pervasive, and hence should be done early in the project. A stitch in time saves nine.

“Premature optimization is evil” is a tautology – if it’s premature, then you are doing it too early. The skill is to know what optimizations (and just general efficiency consideration) are worth doing.

Comment by Mick West — October 23, 2007 @ 8:53 am
I think that this article provides good examples of premature optimisations. It fails to address what IMO is the major thing to note about performance issues: detecting potential performance issues up front is important and useful, but solving them ad-hoc up front is often counter-productive. That, to me, is the meaning of “premature optimisation”.

Take for example the “avoid iterations” tip. Avoiding unnecesary iterations is indeed a good idea. The example, though, shows why that kind of optimisation may not be good. The solution of using a list sounds like a good one, but what effect does it have on the rest of the system, when it comes to managing the objects, and is it really necessary, when other parts of the system might include space subdivision or proximity maps which would allow fast retrieval of objects? You might end up with a list solution, which not only took longer to code and was harder to debug, but made some parts of the system know about other parts they didn’t need to know about before, and eventually got thrown out because it’s still a slow and naive solution (still requires iteration to find nearby objects).

The point is, optimisation needs to be done with a global view. Potential performance problems can either be abstracted out — it’s pretty easy to create a “GetTargetableObjects” function that will be easy to change, as long as you find the right place for it — or have a generic solution implemented, one that’d be useful for all the various subsystems, with view of their requirements.

Another example of premature optimisation is memory pools. Sure, they will be faster than allocations and deallocations, but it’s also harder to find memory leaks when using them, since nothing is really released. Abstraction may help here, too, allowing allocations and pooling to be switched transparently.

So know and consider that iterations are slow, that RTTI may (in some cases) be an issue, and that allocations are slow. Then think of a way to write your code so that you could start with them and implement them out later. This way you’ll get a first playable (though slow) version more quickly, be able to squash high level bugs more easily, and when the time comes that the design of the game changes completely (as a result of that playable version) or your engine is switched, at least you didn’t write tons of ad-hoc code that’s no longer useful, or, worse, that you feel you must keep because you’ve worked on it so long, even though doing a code redesign to match the new game design would have been good.

Comment by ET — October 25, 2007 @ 5:43 am
The examples might not be the best, but I’ll stand by my point – that some optimizations are best done early, since they have such a large impact on the system that they are difficult to do later, and might be impossible to implement safely under a tight deadline.

The examples were just from my personal experience, and were mostly based on programing on the Sony Playstation 2. All were things that gave a good performance boost, and that I either did early, of wish I had.

Calling memory pools a premature optimization because they hide memory leaks seems a bit off. You can (and should) add debugging code to find leaks in whatever allocator you choose. It’s just as possible to introduce a memory leak bug at the end of a project as it is at the start.

I do agree with you though about the abstraction of reasonably high-level functions like “GetTargetableObjects”. Appropriate abstraction can help isolate potential bottlenecks. You do have to exercise caution, obviously, that abstracting excessively does not in itself introduce performance problems.

On your last point, you are basically talking about rapid prototyping. I’ve no problem with that, but calling something a “mature optimization” does not necessarily mean it has to be in there from day 1. Rather it means you don’t leave it too late. “Too late” can vary, based on the nature of the optimization, your code, and the project.

Comment by Mick West — October 25, 2007 @ 7:33 am
[…] Mature Optimization — Ð¾Ð¿Ñ‚Ð¸Ð¼Ð¸Ð·Ð°Ñ†Ð¸Ñ, ÐºÐ¾Ñ‚Ð¾Ñ€Ð°Ñ Ð½Ðµ “premature” […]

Pingback by weekly linkdump - max - Ð±Ð»Ð¾Ð³ Ñ€Ð°Ð·Ñ€Ð°Ð±Ð¾Ñ‚Ñ‡Ð¸ÐºÐ¾Ð² — October 25, 2007 @ 10:03 pm
Wow. This article… it’s like you’re reading my mind.

Each topic, reading the first sentence, I was thinking, “but what about…”, or “Yeah, like situation x…”, and the next sentence addrssed _exactly_ that.

Two examples, the section on allocation, I was thinking of particles, and a custom allocator, next sentence, about particles, and a custom allocator.

And about iterating over the list of objects, I was thinking “targetable objects, and a target list…”, next sentence: targetable objects and a target list.

It’s a good writer who can write in such a way!

Comment by SteveC — March 9, 2008 @ 8:06 pm
I agree with most of the points except for a couple…

“Avoid Allocations” – C++ allows you to override the new/delete operators at the global or class level. As long as joe-programmer uses new/delete instead of malloc/free, he can later use that tuned pool allocator.

“AVOID ITERATIONS” – this seems like a local operation if the abstractions are configured correctly. If the system uses an subject/observer for whenever an object is added/removed, then the list of specific objects can be updated locally instead of globally.

“PROFILE INLINE FUNCTIONS” – This is absolutely supposed to be a late optimization. First get the algo correct and unit-tested, then profile and optimize. I guess the best approach is to build and then optimize, build optimize, etc…

“AVOID ALIASING PROBLEMS” – this is only matters inside inner loops so is a local optimization…

Comment by Ben Schleimer — September 29, 2008 @ 10:08 am
Very good article overall.

The point I’d like to make is that the whole “PROFILE INLINE FUNCTIONS” thing is a bit back to front. The profiling should be done before the inline keyword is ever used. The ability to profile the function when it’s not inlined is just one of the good reasons not to inline until near the end, or at least not aggresively inline until then.

Comment by Malcolm — May 8, 2009 @ 6:46 pm
True, you can leave inlining until the end, however the performance implications of not inlining can be substantial. There’s no great reason to delay it, especially as it does not change the actual logic or data in the application. Since inlining can be a significant performance boost, I think it’s generally important to do it early. Just understand what it going on.

Comment by Mick West — May 8, 2009 @ 11:09 pm