Cowboy Programming Game Development and General Hacking by the Old West

March 23, 2008

Debugging Heisenbugs

Filed under: Game Development,Inner Product — Mick West @ 4:29 pm

(This article originally appeared in Game Developer Magazine, October 2007, in a slightly different format)

A Heisenbug is a type of bug that disappears or alters its behavior when you attempt to debug it. The word “Heisenbug” is a slight misnomer, referencing Heisenberg’s uncertainty principle, which describes how, in quantum physics, it is impossible to know both where something is, and how fast it is. A related phenomenon is the “observer effect”, which says you cannot observe something without altering it – this “observer effect” is what causes the problems we call Heisenbugs.

Heisenbugs are common in game development, most frequently in lower level code. A programmer may encounter several such bugs in the course of development, and a failure to appropriately handle them can seriously derail development, as it may take many days to track down the elusive bug. This article discusses some of the causes of Heisenbugs, and gives some guidelines for avoiding them and tracking them down.

RANDOM CAUSES

The causes of Heisenbugs are as varied as the causes of regular bugs. But some factors are more likely to result in a Heisenbug. Typically those bugs are highly depended on what are essentially random factors which are outside the programmer’s control.

The most literal example of this would be a bug that is caused by the generation of random numbers. Perhaps a table overflow bug might only occur when two particular random numbers are generated in sequence. Random number generation is really not random, you are usually just generating deterministic, but random looking numbers in sequence. But because the amount of numbers generated can be affected by the game state, which is in turn affected by the user input, then these pseudo-random number quickly become unpredictable. To remove this possibility, try making the random number generator return the same number, and see if the bug still occurs.

Other essentially random factors could be the addresses of dangling pointers, the order of data processing in multi-threaded algorithms, the contents of an unflushed cache that is underwritten by DMA, the contents of uninitialized memory (see later), the assumed state of a GPU register, user input (especially analog), read and write times for persistent storage, the persistence of values in improperly synchronized memory (volatile variables). The key diagnostic technique here is to try to eliminate all sources of randomness or indeterminism.

UNINITIALIZED MEMORY

Often when memory is allocated, or variables are instantiated, they are not set to any particular value. Generally this is not a problem, as the code that uses that memory should initialize it to some meaningful value. However, badly designed code, or code that is extended without fully understanding the full implications of the extension can introduce code pathways which result in memory being used before it is being initialized. This will result in a Heisenbug if the uninitialized value is generally the same value, but under certain circumstances the value changes because of changes in the flow of unrelated logic.

That’s a fundamental problem with Heisenbugs, they often appear to be related so some kind of game function that is in fact basically unrelated – (Example: “The game glitches when I open a box”). This can result in a wild goose chase, where you focus your efforts on what seems to be the cause of the bug (code related to opening boxes), and the real problem is in something entirely unrelated.

This can cause problems with assigning bugs to the correct programmers – if a bug is assigned to the game object programmer, simply because the glitch happens when boxes are opened, then you may have a programmer fruitlessly spending several days trying to track down a bug that is nothing to do with them. This can be highly problematic if the assigned programmer is a junior programmer, and unfamiliar with such problems. For this reason it is important that such imprecise bugs be evaluated by a more experienced programmer, and the junior programmer is able to ask for help if their hunt for the bug leads them out of their domain.

Uninitialized memory Heisenbugs can be tracked down by initializing memory to a known value, but one that is more likely to cause a problem than zeroing the memory, such as 0x55555555. Uninitialized variables can be nipped in the bud by having your compiler not allow them. This may be a language default, such as in C#, or a warning, such as in C++. If it is an available compiler warning, then it is highly advisable to make this be an error, so the code will not compile with this warning. While this may require a few minor annoying code changes to get around the warnings, it is generally preferable to the problem of last minute debugging of a Heisenbug, lost in a stream of compiler warnings.

MEMORY CORRUPTION

One of the hardest types of Heisenbug to track down is random memory corruption. In this bug, with random frequency, at a random point in time, a random location in memory has a random value written to it. The less randomness involved; the better for the debugger. If it happens at a particular time, you can try to determine what exactly is going on at that time. If it’s at a particular location, you can trap the write, or look into what code or data has pointers to that location. If the value written is always the same, then sometimes that holds a clue. If it’s always 0x3fe80000, then that’s 1.0f in floating point, so ask what might be storing a 1.0 in memory.

If it’s totally random (but reasonably frequent) that’s actually fine too, as writing to random locations can usually be caught in the debugger, as it will eventually write to an illegal location, and you can set a write access breakpoint on read-only data.

The worst problem comes when the memory being corrupted is randomly within a narrower range of memory that is constantly being written to by legal processes, such as the stack (used for local variable), or a dynamic heap, where memory locations are constantly being used and reused. In this situation, unless you can narrow down the precise point in time the bug occurs, you will be unable to observe the corruption happening, or set a breakpoint, as all the other writes in that memory area will obscure the moment of corruption.

If it’s difficult to see what is being corrupted, and how much, and if you can see the corrupt values after the fact, then again you can try to characterize the corruption from the nature of the data. If a block of three or four words is corrupted, perhaps with values that start (in hex) with 3, then are followed by a bunch of very random digits, then that might be a clue. See figure 1a

Figure 1a – a hex dump of some ASCII data (file names) with some corruption on the second line. The numbers look like they might be floats.

5c6b6369 73636f64 6d61675c 6e697365  ick\docs\gamesin
3e6fdb1a bd0ee1b0 3f7909cd 6f635c6b  .Ûo> °Ã¡. ½Ã.y?k\co
655c6564 706d6178 5c73656c 6d617865  de\examples\exam

Figure 1b – the same data, but viewed in float mode. The numbers that are actually sensible floats are quite obvious.

2.6502369e+017 1.8019267e+031  4.3599426e+027  1.8062378e+028
0.23423424    -0.034883201     0.97280580      7.0364824e+028
6.5049435e+022 2.9386312e+029  2.7403974e+017  4.3612297e+027

Here the corruption is not immediately apparent in the hex view. But looking at the ASCII data, you can see where things are going wrong. Then looking back at the hex, we see the first three words on the second line are actually very different, they look like they might be floating point values (two of them start with 3), so we switch to floating point view (figure 1b) and we see that yes, they are very sensible floats (most floats in games are small, usually less than one). Looking closer we can see they actually form a unit vector.

So these are all clues. They don’t tell you where the corruption is coming from, but they do tell you a little about it. In this case, something is writing a solitary unit vector to memory, and not corrupting the memory on either side. Perhaps you already have some suspects, and this might help whittle them down. Or perhaps this is your first clue, in which case it is a valuable first step, and can help you mostly eliminate many other things from consideration (all the code that could not be writing unit vectors).

TRACKING THE UNTRACKABLE

But how do you find something that vanishes when you look at it? A Heisenbug in a game will come up with a certain frequency. The more frequently it occurs, the easier it is to track down. Even a bug that occurs as infrequently as once a week can eventually be tracked down (although hopefully you would have a few weeks left on the project).

If a bug cannot be isolated by normal means, then you must look at circumstantial evidence. What is happening when the bug occurs? What just happened? What was going to happen? Perhaps the bug occurs only on a particular level, or in a particular area of the game. Try to build up a characterization of the bug, no matter how vague.

Enlist the help of the testers here. They play the game in ways very different from the way programmer play the game. A good tester will try to make a bug happen more often, and will often come up with convoluted theories as to what sequence of events they think precipitates the bug. These theories are often wildly off the mark, and contain many red herrings, but they also can contain many valuable clues. If a tester can reproduce a bug in a reasonably period of time, even an hour or so, then it is often worth watching the tester do this, as the programmer could quite easily waste several hours or days in fruitless code speculation, when observing some gameplay might provide a clue.

The classic definition of a Heisenbug is one that goes away when you look at it. This is generally not strictly the case. While it is true that you often get bugs that only occur while playing the game, and not when you hook up the debugger, or when you recompile in debug more, you can always make some changes to the situation that will tell you more about the nature and location of the bug.

FIXING BY NOT FIXING

Characterizing the bug by describing the gameplay situations under which it occurs (or is more or less frequent) is half the story. The other story is what modification you can make to the code, and how the affect the bug.

If you’ve gone through the usually debugging methods, and failed in isolating this elusive bug, then you need to focus on narrowing it down. Now a Heisenbug is different from a regular bug. Heisenbugs are sensitive to state changes in the total state of the program. If you remove some code, and that prevents the bug from happening, it generally tells you nothing definite about the bug – you’ve quite possibly simply modified the state so the bug is either removed or hidden. You can’t tell either way. For example, if you suspect synchronization issues, and you turn off multi-threading, and the bug goes away, this unfortunately does not mean that you have isolated the cause of the bug. It’s a clue, but turning off multi-thread so greatly alters the state of the system in so many ways, you could simply have hidden the bug.

On the other hand, if you remove some code and the Heisenbug still happens, then paradoxically this could be much more useful. You have eliminated some code that is nothing to do with the bug, meaning you don’t need to consider that code any more, and your field of possible culprits shrinks. If you turn off multi-threading, and the bug still happens, that means you can be 99% sure it’s nothing to do with multi-threading, and you can move on with confidence, having eliminated a huge range of possible causes.

As well as narrowing down the bug in this way, you can try to clarify its location (and speed your tracking) by trying to make it happen more often. You have to get quite creative here, focus on amplifying the bug. If it seems to happen when more instances of a certain object are in the level, then modify the level so there are hundreds more of those objects. Make bold sweeping moves here, if it often happens when explosion are triggered, then trigger thousands of random explosions. If it happens when running fast, then double the running speed. Stress test the game until the bug either become repeatable, or its nature is revealed.

MAGICAL THINKING

Mental discipline is important when tracking Heisenbugs. Their very nature makes it very difficult to discern anything concrete about them and so even quite wild theories can start to take root in your mind. Perhaps, you might think, your computer or dev-kit is malfunctioning? Perhaps there are glitches in the power supply? Perhaps that flickering light is causing EMF resonance in the CPU? Perhaps vibration from passing trucks is jigging a loose component in the motherboard? Perhaps there is a bug in the compiler?

This is magical thinking – it is tempting to ascribe some esoteric cause which absolves you from guilt, but it’s rarely true. Much time can be wasted by entertaining these remote possibilities, especially with bugs that are highly intermittent. It is import to dispense with these ideas at once. If you suspect your computer, then change it. If you think there are problem with the power supply, then install a UPS or move to a different circuit in another room. Perhaps it was a cosmic ray, but it’s vastly more likely there is something wrong with the code.

It’s also tempting to blame the compiler. Compiler bugs do exist, but they are very rare. For all the bugs where the programmer has said “that can’t possibly be a code bug, it MUST be the compiler”, in 95% of cases, in my experience, the problem has turned out to an ordinary bug, and not a compiler problem. If it IS a compiler problem, then that may require the assistance of someone familiar with the very low level debugging required during the final stages of tracking this down.

Heisenbugs are mentally difficult for programmers to deal with. It is very frustrating to have something that eludes clear methodical debugging, and where you are forced into speculation, experiments and even debugging based on vague statistics. But a single Heisenbug can derail a project, especially if it is not addressed as soon as possible. Some Heisenbugs crop up only when the system is stressed, which might not be until just before beta, when all the assets and systems are fully incorporated. Programmers should be familiar with the possible causes, and general debugging techniques for dealing with Heisenbugs.

RESOURCES

Why Programs Fail: A Guide to Systematic Debugging, Ch 4, by Andreas Zeller, Morgan Kaufmann Publishers, 2006

Cross Platform Game Programming, Ch 6, by Steven Goodwin, Charles River Media.

Debugging Concurrency, Philippe Paquet, June 2005, Gamasutra, http://www.gamasutra.com/features/20050606/paquet_01.shtml

4 Comments »

  1. You forgot one important cause of heisenbugs: assert()s that _do_ something.

    Comment by b100dian — March 24, 2008 @ 6:27 am

  2. Another couple of good things you can do:
    Plug in static analysis tools, often times many such bugs can be caught at compile time, although not always. However, you can also plug in data-flow analysis tools if you have the budget for them, those can almost always catch those suspect areas, either before they happen or even during the act.

    Comment by Washu — March 29, 2008 @ 12:15 pm

  3. […] Once I debugged a program that read 1-2 characters past the end of buffer when some specific input was provided. The bytes after the buffer were usually readable, but if they were not, the program crashed. It was hard to catch the bug, because it occurred rarely (being similar to what Mick West calls Heisenbugs). […]

    Pingback by smallcode » Blog Archive » Debugging buffer overflows — April 14, 2008 @ 4:58 am

  4. Uh how I hate debugging. Can’t even think this kind of situation when your bug is chanching when trying to catch it.

    Comment by lehti — November 11, 2009 @ 12:28 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

You must be logged in to post a comment.

Powered by WordPress