Cowboy Programming Game Development and General Hacking by the Old West

September 9, 2008

Debugging Memory Corruption in Game Development

Filed under: Game Development — Mick West @ 7:42 am

Definition:   Memory corruption is an unexpected change in the contents of a memory location.

The symptoms of memory corruption can range from hard crashes, all the way through minor glitches, to no symptoms at all. The causes of memory corruption are many and varied, and include memory corruption itself.     In this article I attempt to classify the various ways in which memory corruption can manifest itself, the various causes, and some ideas for identifying the root causes of various types of memory corruption.   I’ll cover:

  • Symptoms of Memory Corruption
  • Investigating Corruption
  • Identifying Hex Droppings
  • Causes and Effects of Corruption  

 

Symptoms of Memory Corruption

Given that memory corruption can manifest in almost any way, it seems redundant to list all the symptoms. However, different symptoms of memory corruption are inticative of different causes of corruption. Sometimes we can also gather valuable clues from the type of symptom, which might lead us closer to the cause of the corruption.

 

Crashes

Crash bugs come in all flavors, but memory corruption can cause just about all of them. The way in which the game crashes can give you valuable clues as to what type of memory corruption is occurring. These clues can indicate where you need to start looking for the cause of the crash.

Address Error

An address error indicates that a pointer has been modified to point to an illegal address. This could be an address that is: not word aligned, NULL, or an address that points to unmapped or protected memory.

Address errors are quite helpful, since program execution stops when an address error is encountered, and it is quite easy to enter the debugger, and determine the address and corrupted contents of the pointer variable that is being used.

Infinite Loop

Corruption of data can make a loop fail to terminate. Take for example code that traverses a linked list. Memory may be corrupted in such a way that the list contains a loop. Since the code expects the list to terminate with a NULL value at some point, it simply carries on around the loop forever.

This behavior is a lot more likely with a list that uses indexing instead of pointers, but it is still possible with pointer. Consider  the implication of memory being corrupted in such a way that a list gets a pointer corrupted so that the list is now circular. It is very unlikely that some random corruption, or some unrelated code would happen to stick a semi-valid pointer in the right place. Hence, it is more likely that the corruption was something in the list code itself.

 

Illegal Instruction

An illegal instruction could mean one of several kinds of memory corruption.

 

Stack Corruption – If the stack has been corrupted in some way, this can lead to an incorrect return address, which ends up pointing to illegal code. This is the most common way buffer overruns are exploited by hackers.

 

Jump Table Corruption – if a v-table (or any kind of table of jump-addresses) is corrupted, then the PC can end up pointing at an illegal instruction.

 

Code Corruption – The code itself can be corrupted by a bad pointer corrupting sections of the code. This type of corruption can be very hard to detect if the code that has been corrupted is not executed very often.

 

Stack Overwriting Code – A special kind of code corruption. Runaway recursion can sometimes run unchecked until the stack overwrites the routine that it is executing. This shows up nicely in the debugger in a hex window.

 

Function Pointers – Since function pointers are sometimes stored in data structures, and passed around like regular variables, then they can be corrupted just like any other variable. This can eventually lead to the program executing incorrect code.

Unexpected Values

If you have a variable of some kind that normal has a value in a certain range, and you unexpectedly find that the variable contains some ridiculous value, then this may be due to memory corruption.

Wildly unusual values often have noticeable effects, such as the player teleporting to the end of the universe, or a model being scaled infinitely large.

Less severe corruptions can occur, for example, a counter might simply be reset to zero or even just changed slightly. This type of corruption can be difficult to track down, as it may not produce especially noticeable effects.

Here a good testing department is invaluable. If the testers can notice little inconsistencies like this, then you will catch potentially harmful bugs at a much earlier stage.

Since the location of the corruption of memory is often somewhat random, then the problem may go undetected for some time. This may give the false impression that the existing code is solid. Upon adding new code or data, the bug may reveal itself, causing you to think that the new code has caused the bug, when in fact the new code has only cause memory to be slightly re-ordered into a configuration that reveals a pre-existing bug.

 

Glitches in the Graphics

Since memory often contains graphical data, then if memory is being corrupted, it may show us as some corruption in graphics. The way this is manifest will depend on the nature of the graphics, and the nature of the corruptions.

Textures

Changes in color of a single pixel, or a very short row or column of pixels, indicates that a pointer to a variable has acquired a wrong value, perhaps the result of earlier memory corruption.

Changes to large swathes of a texture indicate either an incorrect pointer, or some kind of buffer overrun. Corruption that looks like a regular patter, often containing vertical or diagonal stripes indicates some kind of array exceeding its bounds, or one that is now at an incorrect address due to a corrupt pointer.

Corruption in a texture that resembles a squashed or discolored version of another texture indicates that you might be overwriting the texture with another one of different dimensions or different bit depth.

If the corruption is static (unchanging), then it indicates a one-time event, where a pointer was misused just once. The corruption happened, and the game went along on its way. In this case, you need to try to track down what triggered that event. Testers need to try to find a way of duplicating the circumstances that lead to the visual corruption. Video of the game is very useful in this case.

If the corruption appears to be animating, if the corrupt section is flickering, or the banded area is flashing on and off, then you have some ongoing corruption. If the game remains in this state, it should make it easier to debug.

 

Meshes

Corrupt meshes usually result in some vertices being displaced a considerable distance from the model. If the corruption region is small (a word or so), then you may just see one vertex displaced, this will appear as a thin triangle or line that extends off screen and swings wildly about as the model animates.

Corruption of a large amount of the model’s mesh can result in the model “exploding”, covering the entire screen with random looking triangles that flicker and swing around.

 

Skeletons and Animation

Corruptions of the underlying skeleton data, or associated animation, can result in the model still looking somewhat recognizable, but with the various body parts being displaced to unusual locations. Corrupt animation will result in body parts flickering and jumping around wildly. The exact manifestation of the symptoms of corruption depends upon the method used to store the animations.

 

Investigating Corruption:

If you suspect that memory corruption is occurring, then your first step is to try to determine if this is actually some form of corruption, and what type it is.

 

Is it actually corruption?

 

Just because a value in memory looks rather unusual, does not mean that it was not generated by the code that owns that memory. The unusual value might simply be the result of an error in your logic. It could have been quite legally copied from somewhere else. It could be the result of computations involving incorrect data, perhaps data that was already corrupt.

To determine this, you need to determine if the code that you might think is writing to that location actually is writing to that location, and see what values it is writing. Ideally, you would add assertions at all location that you think might legally be writing to that location, and check the range of values that are being written (make sure the “corrupt” value is outside this range.)

 

Who owns that memory location?

Memory corruption usually occurs when some piece of code is using an area of memory that it should not. The corrupt memory then causes problems in some code

There are two primary ways in which this can happen: corrupting a legal area, and using an area illegally.

Consider a piece of code A, that uses and area of memory A(m). If another piece of code, B, also happens to have a pointer to A(m), and writes some data to that, then code B is corrupting memory A(m). This is the normal form of memory corruption.

Now consider if the code A is legally using memory location A(m). Code B is illegally also using some location within (or overlapping) A(m). Code B appears to work correctly, but then code A makes a legal update to A(m), causing code B to manifest a bug. It appears that code A is corrupting memory B(m). However, the fault here is with code B. It has the appearance of corruption, yet may mislead you to thinking that the problem is with code A.

It is important to determine who actually owns the memory location that is being corrupted. Is the “legal” use actually legal? Can you demonstrate that code B actually owns those memory locations? If you can quickly determine that code B does not actually own that memory, then the tracking down of code A is irrelevant, which can save you substantial time.

 

Repeatable, Fixed Location

If the corruption is consistent, meaning it happens in the same location and under the same conditions then you are (relatively speaking) in luck. Debugging in this case is a matter of somehow watching that location, and tracking down the cause of the corruption. Since the corruption happens under the same conditions, you should be able either to trap it immediately, or quickly narrow down the possibilities.

Intermittent, Fixed Location

If the corruption happens in the same memory location, yet is intermittent, then this makes tracking down the corruption more difficult. Since you do not know when the corruption occurs, you cannot be as focused in your search, and must rely more on general observation as to the nature of the corruption when tracking down the cause.

Intermittent, Variable Location

If the corruption happens in varying location, and at unpredictable times, then your debugging options are often limited to making observations about the corruption after it has occurred.

 

Determine the location of the corruption

If the memory corruption is the immediate cause of the bug, such as with an address error due to a corrupt pointer, then you may be able to immediately determine the effect of the corruption simply by seeing what address was being accessed at the time of the bug.

If the memory corruption is an intermediate cause, then you will track down the address of the corruption in the process of analyzing the immediate cause of the bug, and any intermediate causes that lie between the root cause and the symptoms.

Hardware Breakpoints

If your target platform has some kind of break-on-access breakpoint, then use this as your first line of investigation when debugging memory corruption with a know address. Simply set the debugger to execute a breakpoint when a memory location changes, then when the location happens, see what code is executing.

This technique can work very well if the location that is being corrupted contains data that is relatively static. However, if the location contains some dynamic variable that changes hundreds of times per frame, then you may have some difficult in finding the single write access that is causing the problem.

In that case, you may be able to augment your write-access breakpoint with a conditional check that verifies that the data being written to the corrupt location is in the valid range.

Sometimes memory is corrupted with vales that are within the valid range, but nonetheless are wrong. Your options here are more limited:

 

– Repeatedly run the code, and each time the breakpoint trips, look at the call stack until you see something that you do not recognize as code that can legally write to this location.

– If the legal places that write to this location are known and relatively limited, then update them to first write to some separate location. First ensue the corruption does not also affect that separate location, and then update the breakpoint condition to check the value written matches the stored value.

Try tracing through the code, stepping over functions, when you find one that does the corruption, and then next time around, dig into that function.

If you don’t have this debugger functionality, then you can still roll your own by writing a little function that checks that memory location.  You can then sprinkle calls to this function through your code, narrowing down the region of code that causes the error.   If the meory location’s address is dependent on the code, you may need to compile the code, note the new address, and then re-compile with the new address wired into the code.  

Another manual method is to keep track of allocations that include that particular location.   Memory corruption is often due to a dangling pointer that was once legal.  So if you know the location of corruption in advance, then having a list of the callstacks of all the allocations that once owned that location can help quickly identify the culprit.

Identifying Hex Droppings

A memory location has been corrupted. Assuming you cannot quickly find what bit of code is responsible for the corruption, you can learn a lot about what that piece of code might be by examining the nature of the corruption.

Once you have identified the location that has been corrupted, then look at a hex dump of it in the debugger (or print out your own if a debugger is not available). A hex dump looks something like this.

 

0x00322B90  fd fd fd fd ab ab ab ab  ýýýý « « « «
0x00322B98  ab ab ab ab ee fe ee fe   « « « «Ã®Ã¾Ã®Ã¾
0x00322BA0  00 00 00 00 00 00 00 00  ........
0x00322BA8  12 00 0d 00 22 07 18 00  ...."...
0x00322BB0  48 2b 32 00 40 2c 32 00  H+2.@,2.

 

The memory address is on the left, then comes the contents of memory, here listed eight bytes to a row, and then those eight bytes are repeated as ASCII characters

Single Bit Corruption

Few pieces of code will cause only a single bit to be flipped. The most likely candidate is a bit-field of flags.

Single Byte Corruption

If only one byte was modified, then that can narrow down the fields considerably. If the corrupt value is 0 or 1, then perhaps it is a byte flag.

Single 32-bit Word Corruption

A 32 bit word is often the fastest and most convenient way of storing data. It is the only way for certain data types such as floats or pointers (depending on your platform). Looking at the contents of the 32 bits will tell you something about the code that inserted that value there.

If you know that a 32 bit value is being corrupted, then you should view the memory location as a single 32 bit word, rather than as a sequence of four bytes. This removes any confusion with endianness, and makes the type of data much easier to recognize.

That said, it is also a useful skill to be able to recognize certain types of data as a byte stream, since the data may be intermingled (in a class) with other data of varying types. In the examples below we give the values both as a 32-bit integer, and as a four byte little-endian format, which harder to recognize than big-endian, since that is just the word with the bytes spread out.

Zero

Example:

00000000   or   00 00 00 00

Zero is easy to recognize. At first, you might not think there is much information in a zero, but consider the limited number of reasons a piece of code could be writing a single zero to a location in memory, and it may give some clue as to what piece of code might be responsible.

 
Zero is:

 

NULL – Perhaps the errant code is clearing a pointer? Some programmers make a habit of cleaning any pointer that is a member variable after they have deleted whatever it was pointing to (a reasonable practice to help prevent dangling pointers). However they might be doing it at the wrong time.

 

Zero. Both as an integer (0) and as a floating point (0.0f). Where in the code are individual values set to zero?

 

FALSE. Perhaps the code is treating the location as a flag, and simply setting it to FALSE.

 

The first value in an enum – Perhaps a type field, of a status field. What kinds of enumerations do you have in your suspect code? What does the first entry mean? What causes the code to write out the first value?

Clear and empty  – Often data structures are initilized to zero.  Does this happen anywhere in the code?  Does the size of the data being cleared match the zeros in the corruption?

One

Example

00000001 or 01 00 00 00

 

One is also easy to recognize. Less common that zero, it can still tell you something about the code that wrote it there.

 

One is

 

TRUE – Perhaps it is being used as a flag. What could be set to TRUE?

 

An integer – Hence it’s not a floating point number. You can discount code that stores floats.

 

Not a pointer – Odds are that the code causing the corruption is not thinking that it is storing a pointer, unless it is a secondary bug.

 

The first value of an enum – like any small number, it’s possible it is an enumerated value, possibly a type number.

 

Floating Point Numbers

 

Example

3F800000 or 00 00 00 80 3F

 
Many floating point numbers have an easily recognizable format. A very common floating point value is the one shown above, 3F800000 is the hex representation of the 32-bit floating point value of 1.0. See Table 24.X for additional values.

 

Table 24.X

Float			Int

0.00000000		00000000
0.50000000		3F000000
1.00000000		3F800000
-1.0000000		BF800000
2.00000000		40000000
100.000000		42C80000
0.33333334		3EAAAAAB
3.14159274		40490FDB

 

Notice how the small values start with a 3. A floating point number has the first bit being the sign, the next eights bit being the exponent, and the following 23 bits being the fractional part. Since numbers in the same range tend to have similar exponents, you can often recognize a group of floating point numbers of similar magnitude.

In games, a very common range for floating point values is from -1.0 to +1.0. These numbers are used extensively in unit vectors, transformation matrices, UV coordinates and scaling factors. Numbers in this range usually start with a 3 (for positive numbers), or a B (for negative numbers).

If you suspect it is a floating point number, you can then sometimes tell if it is an original (hard wired in the code) value, or a value arrived at by calculation. Consider the numbers above. The values 1.0, 2.0, 0.5, 100.0 all have trailing zeros in their hex representation. The value 3.3333334 also a sequence of AAAAAA in it.

By contrast, the less rational number 3.14159274 has what seems to be a random string of hex digits. We can see the degree of entropy in the hex number matches that in the floating point number.

So, a floating point number that has been the subject of some computations is much more likely to have random looking hex digits. Hence, uou can tell if you are looking for code more like from an update function:
 

p->m_speed = sqrtf(p->m_speed*p_m_speed - 2.0 * g * h);

or from an initialization function

p->m_speed = 2.5f;

 

Small Integers

 

Small integers (in the range 0 to 10000) are usually counters or enums. If you see the value incrementing or decrementing evenly, then that indicates a counter.

If you see it oscillate between a few fixed values, then it is probably some kind of state variable.

Does this small integer seem to match anything in the game at the time of corruption? Some possibilities:

Score
Health
Lives
Level number
Weapon number
Button Pressed

Try to find some correlation between what is going on the game, and the value of corruption.

Large Integers

 
As numbers get larger, the number of uses for them decrease. It’s unlikely that you will be managing groups of over 100,000 items. If you have a large integers that look like thye are counting, then you should consider what it might be counting.

Consider then if it might actually be a pointer, or a code address, and not an integer value at all.

Negative Integers

Example:

FFFFF3A2 of A2 F3 FF FF

Negative integers start with ‘F’s rather than ‘0’s.

Integers are generally used for counting things. If you have a negative integer, then that greatly narrows down the range of things it might be used for.

Some code uses the negative form of an integer as a single kind of flag to change the behavior of the code, avoiding the need to have an additional flag.

Negative numbers are also sometimes used as error codes. Some functions take a pointer as a parameter, and then return the error code in the location pointed to by the pointer. If the pointer is incorrect, that will lead to memory corruption with a negative number.

 

Magic Hex Numbers

 

Example

DEADBEEF or EF BE AD DE

 

A magic hex number in the context of debugging is a hex value that has been specifically chosen by the programmer to be visible in the debugger.

The numbers are also chosen so that using it inadvertently will maximize the chances of that use causing an error, and hence alerting the programmer to the illegal usage.

The most common use is in initializing a block of memory to certain values both when it is allocated and when it is freed. This both makes the block visible in the debugger (in the memory window), and also fills it with values that the programmer should notice if they are used either before the memory has been initialized correctly, or if the memory continues to be used after it has been freed.

 

Common Magic Hex Numbers are:

CCCCCCCC
CDCDCDCD
DEADBEEF
DEADDEAD
DDDDDDDD
FDFDFDFD

 

Use of magic numbers varies by platform. Often developers use their own magic numbers, and they tend to prefer those that can be read aloud, such as DEADBEEF.

 

Magic ASCII

 

Example:

474E5089 or 89 50 4E 47 or ”°PNG

 

Frequently asset files are identified by a four byte (partially) ASCII string that indicates the file type in some human readable way. It’s quite unlikely that this will find its way into a single word corruption, but it’s worth looking in the ASCII column in the memory window, just to check if this is the case, since if you recognize this, it should hopefully point you directly at the culprit.

 

Pointers

 

Example

00434150 or 50 41 43 00

 
Your program usually occupies a relatively small amount of the available four gigabyte address range of a 32-bit pointer. Hence, pointers usually fall within a recognizable range.

Under Win32, your executable starts at address 00400000 (4MB from the start of it’s virtual address space) so function pointers, and pointers to static data will often start with 004 (and 005, 006 etc as your program increases in size).

On the PS2, your executable start at 00100000 (1MB), so pointers will start with 001, 002, etc.

Function pointers are an unlikely candidate for corruption data, so if you see a pointer like this, it’s more likely a pointer to some static data.

The most common type of pointer to static data that is passed around is a pointer to a string. If it looks like you have a pointer in your corruption data, then try following it and see if it points to a recognizable string.

Depending on your platform, pointers may be more likely to be word aligned. On the PS2, pointers to code or any word sized data must be word aligned. The PC allows all data referencing at the byte level.

 

Random Numbers

 

Example

9D29F113 or 13 F1 29 9D

 

When you look through the memory occupied by your game, you will find surprising little data that looks random. There are usually lots of zeros, and where the data is more closely packed, certain bytes or patterns predominate.

So when you find a number that looks random, it almost certainly has some meaning. Here are some of the things it could be.

 
A floating point number – as mentioned previously, a floating point number with several significant digits will look kind of random. The constant pi (3.141592654) comes out as 40490FDB – which looks random.

 
A checksum – if your code uses a checksum, such as CRC32, for some reason, such as identifying assets, then this could be a stray one. If you have the capability, then try seeing what string generates this checksum.

 
Compressed data – well compressed data should look random. It’s unlikely that it would end up in a single word of corruption, but possible.

 
Text – It looks random at first sight, but if the bytes are mostly in the range 0x30 to 0x7F, then it is quite possible that it is a fragment of a string. See what it says in the ASCII column of the memory window.

 

Block Corruption

 

Block corruption is where a group of words in memory are corrupted more or less together. The block can be any size, but we are generally talking anything from four bytes to 1024 bytes.

The corruption data in the block may contain any combination of the types of corruption data found in a single word, as discussed previously. There are a few situations specific to block corruption.

 

Partial corruption

 

When the data in the block of memory covered is not entirely corrupt, just say every few bytes or words has been changed, then this is a good indication that we are dealing with a pointer to a data structure (a structure or class) that has gone astray.

The most likely explanation is a dangling pointer. The code is continuing to update some data structure that has already been freed.

Full corruption

 

If the block of corruption is contiguous and no byte within it remains unchanged (except for a few common bytes, like zero, that might exist frequently in both corrupt and correct data), then it seems like the data structure has either been initialized, reset, or copied from somewhere else.

 

Unit Vectors

 

A common arrangement of three floats is in a vector, and a common sub-group of vectors is the unit vector. Unit vectors are quite recognizable in memory, since they consist of three small floating point numbers (in the range -1.0 to +1.0), and so they frequently start with the hex digit 3 or B.

Here’s an example of a unit vector sitting incongruously in the middle of a string.

 

5c6b6369 73636f64 6d61675c 6e697365  ick\docs\gamesin
3e6fdb1a bd0ee1b0 3f7909cd 6f635c6b  .Ûo> °Ã¡. ½ Í.y?k\co
655c6564 706d6178 5c73656c 6d617865  de\examples\exam

 

Looking at the hexadecimal, it is not immediately obvious that anything is wrong. We can see however from the text display column that there seems to be some garbage bytes in the middle of the path name.

Looking more closely at the garbage bytes, viewed as words, we can see that two of then start with 3, and one starts with B – a very good indication that we are dealing with a vector of small numbers, possibly a unit vector.

 

We can then switch to a floating point view, which gives us:

 

2.6502369e+017  1.8019267e+031  4.3599426e+027  1.8062378e+028
0.23423424      -0.034883201    0.97280580      7.0364824e+028
6.5049435e+022  2.9386312e+029  2.7403974e+017  4.3612297e+027

 

This confirms the nature of the corruption. We have three floating point numbers in the range -1.0 to +1.0, we can do a quick calculation to confirm that if we square the numbers and add them it comes out at about 1.0, so the length of the vector is 1.0, a unit vector.

 

Causes and Effects of Corruption

Once you have determined the likely nature of the corruption, you need to identify the piece of code that caused the corruption. If you are not able to directly observe the corruption taking place, you may have to selectively instrument suspicious pieces of code.

To narrow down the field of pieces of code that might be considered, we should have a look at the most common direct causes of corruption, and examine how each cause manifests itself.

Buffer Overruns

A Buffer overrun is perhaps the most common type of bug. You often hear about “buffer exploits” in the hacking world. Here a programmer has neglected to check that the size of the input data fits into the destination space. The data overruns the buffer, and possibly overwrites some space used for code. By adding some appropriate code to the end of the data, an industrious hacker can inject some of his own code into an application and take control of it.

Buffer exploits are less of a security problem for game developers, unless they are accepting data over the internet. However buffer overflows are still a very significant cause of bugs.

 

 

Bad Pointers

 

If the value of a pointer is incorrect, then it can corrupt memory (as well as providing bad data to whoever uses that pointer). The value in a pointer can become “bad” in a number of ways.

 

Dangling Pointer – If a memory block is de-allocated or freed, yet some pointer still references that block (or an object within that block), then that pointer is said to be a “Dangling Pointer”. The value of the pointer has not changed, however the pointer has become bad since it no longer points to valid data.

 

Incorrect Pointer Calculation – The pointer could be generated using incorrect pointer arithmetic, or using other values that are themselves incorrect, causing the value of the pointer to be calculated incorrectly. Pointer arithmetic might also return a pointer out of range of the target buffer – a form of buffer overflow.

 

Corrupt Pointers – The memory in which the pointer is stored may itself have become corrupted due to some unrelated cause. Thus corruption can cause corruption, extending the chain of causes.

 

Bad Local Pointers

If a pointer is created to an object that has local scope, then that pointer will only be valid while that object is in scope. See Listing 1

 

Listing 1

void CheckThing(CThing *p_x)
{
  CThing p_thing;
  p_thing = *p_x;
  if (ThingCheck(p_thing))
 {
    AddToList(p_thing));
 }
}

 

Here a local variable p_thing is being used for some temporary purpose. However, during the course of the function the variable is added to some global list, then the function returns.

The result is that there is now a pointer in some list somewhere that points to memory that is used by the stack. This will not be an immediate problem, since when the function returns, then the stack pointer will recede higher in memory, leaving the instance of p_thing safely below the stack. Then one of two things might happen.

Object gets corrupted – the object pointer to by p_thing now no longer legally exists, however its binary image is still in memory, and code can continue to use it without problems until the stack once more descends below that location in memory. At that point the object may get corrupted. This, in a sense, is not a memory corruption bug, since the writes are legal, and in the correct place. But it behaves very like a corruption bug.

Stack gets corrupted – the object is in a list, and presumably some operations are going to carried out with it. When the stack descends past this point in memory, then if that object is updated via the list, then updating the object will corrupt some memory that is legally being used by the stack. This could be a return address, it could be a saved register value, or it could be local variables in some routine higher up the call stack. Whichever it is, the effects will be deferred until the function call stack returns to that point, which could be quite distant from the cause of the problems.

 

Stack overflow

The stack overflowing can cause memory corruption in a number of ways. A stack is of a fixed size, and immediately it overflows that size it has begun to corrupt memory. What happens next depends on the size of the stack frame, the position of the stack in memory and what lays beneath it in memory.

Not all platforms are equally vulnerable to corruption for stack overflow. Win32 will simply raise a stack overflow exception if the stack pointer writes beyond the bounds of the stack. On platforms that do this, debugging is a relatively simply matter of looking at the call stack, which should have one or two functions repeated over and over, pointing you directly at the culprit.

Other platforms are less fortunate. The PS2 has not special protection for the stack pointer. The stack is frequently placed at the top of the 32MB of memory, which means that if it grows downward past the area reserved for it, it can corrupt data and possibly even code.

Code Corruption

 

As already mentioned, if the stack is allowed to proceed apace through memory, it will eventually overwrite the code that is currently being executed, causing a crash.

Code corruption due to stack overflow can often be seen in the disassembly view of the debugger. If the code before the crash location looks reasonable, and the code after looks repetitive or contains illegal instructions, then overflowing stack corruption is the most likely cause.

 

Sparse Corruption

 

The common cause of stack overflow is runaway recursion. A less common cause is moderate recursion combined with a very large stack frame. This occurs when the programmer has a local variable in the recursive routine that takes up a large amount of space. Example: See listing 2

 

Listing 2

class	CBuffer
{
int x[2048];
}

int DigTree(CTree *p_tree)
{
 CBuffer local_buffer;
 Dig(p_tree,local_buffer);
 if (NotFinished(p_tree))
   DigTree(p_tree);
 Finish(p_tree,local_buffer);
}

 

Here DigTree is a recursive function that has a local variable local_buffer. A new instance of local_buffer must be created on the stack. Since the CBuffer class takes 8K of memory, it takes relatively few recursions to overflow the stack, especially on consoles such as the Gamecube where the stack size is kept as low as possible, often down to 64K or less.

If the huge CBuffer object is not cleared every time the function is entered, then this can have the effect of corruption being evenly spaced every 8K through memory. This can be quite a red herring in a number of ways. Firstly, the first time you see the corruption it might seem to be just a single instance of corruption, making you not think of a stack overflow.

Secondly, it can overstep any tests you do to check for stack overflow. Often you would place some magic numbers in the bottom of the stack, and check to see if they are still there, as a way of detecting the stack has overflowed. If the game does not immediately crash during the recursion, the stack pointer will return to a normal address, and your code will run along merrily until the corruption causes some later problem.

Thirdly, if it is runaway recursion, then the large stack frame might overstep the code that is being executed, causing widespread code corruption, yet not actually crashing in the code that caused the corruption. While this should still point you to stack overflow, the culprit will be less obvious. A stack analysis on the corrupting stack frame will indicate the location of the corrupting code – providing you can detect the write that causes the corruption.

Powered by WordPress