Cowboy Programming

Game Development and General Hacking by the Old West

December 29th, 2008

Some Uses of SQL Databases in Game Development

RELATIONAL DATABASES

Relational Databases are sometimes viewed as being in the domain of business applications and web development. You would use a relational database for boring applications such as inventory, accounting, or implementing a shopping cart system for a commercial web site. Databases can be viewed by game-programmers as old-fashioned, large, slow, and replete with imponderable terminology such as “inner joins”, “foreign key” and “tuple”. However, modern databases are actually fast, easy to use, and can be quite useful in various stages of game development. While relational databases can be immensely complex and powerful, their robustness and ubiquity means they can also be useful for relatively simple tasks. In this article I’ll attempt to de-mystify databases a little, show how they can easily be incorporated into development code, and discuss a few potential usages.

WHAT IS IT?

A relational database is simply a set of tables with named columns where each row is an individual record. Game programmers essentially use many forms of relation databases at run-time to store and organize things like game objects and various resources. Those databases are generally rather ad-hoc, and tuned for a specific purpose and operating environment, and are implemented using custom code. While technically this is a database, the programmer would probably not refer to it as such. The usages of relational databases I discuss below are not intended to replace these custom in-game “databases”, but rather to add new functionality to be used during the development process.

While the examples I give are all for Windows based development, the nature of communication with a database server is basically text based, so a minimum amount of work would be required to implement similar functionality on console platforms.

SETUP

Databases are run by a server, so the first thing you need to do is set up a server. The server can be local, in that it’s part of your code, and you access it directly via a relatively low level API. Alternatively the server can be remote, and you access it via a network connection. This distinction can be blurred a little as you can have a server on your local machine, accessed via the network (using localhost) – but that’s still essentially a remote server, just somewhat quicker. Here we’ll be discussing remote servers.

There are several ways of setting up a database, and what you settle on will vary with your needs and situation. If you have developers in various locations, then you might benefit from having your database hosted by a third party, as this should ensure everyone has sufficiently fast access. If your developers are all on the same network, then you’d more typically have the databases hosted on the network server. If you are a lone developer, then you’d be more likely to have the database on your local machine, to take advantage of the additional speed.

Setting up a database is very easy. If you have remote hosting you are often supplied with a web interface such as phpMyAdmin that allows you to create databases and users. On a local network your network server will often already have some database server software installed, and you can just add a database to that. Lacking this, you can very simply install a database by downloading and installing the MySQL software, which takes only a few minutes to get up and running.

Once you have a database up and running, it’s very important that you have some way of testing your connection and the database, so you can more easily debug problems with your code. A useful tool here is HeidiSQL, a free program that let’s you connect to your database server, and setup, examine and modify databases in a visual manner. There is also more fully featured software such as PremiumSoft’s NaviCat, which performs similar functions. In the examples below I give the SQL query definitions for the database tables. While it’s quite possible to set these up using a command line tool or web interface, it’s generally easier to use a tool like NaviCat, as it allows you to more easily adjust individual parameters in your tables.

A database server can have various users. If you are just doing some initial experimentation, then you can just log in as the “root” user that you set up when you installed the server. However, as you expand the usage of the databases, then you will want to add additional users with fewer privileges to prevent inadvertent modifications to the database.

Once this is set up, you can now connect to your database from your code. The simplest way (from game code) is to use the C API. The code to connect is shown in listing 1. The SERVER_NAME would be the URL or ip address of your server, or “localhost” if it’s on your local machine. This setup needs only be done once when your program runs, and the other examples assume this has already been done, and there is a valid value in the “handle” variable. Error checking is omitted for clarity, but is something you will need to add, especially if connecting over the internet.

LISTING 1 – Code to connect to a database

#define SERVER_NAME “localhost”
#define DB_USER “user_name”
#define DB_USERPASS “password”
#define DB_NAME “db1″
MYSQL *handle=NULL;
handle = mysql_init(NULL);
mysql_real_connect(handle,SERVER_NAME,
DB_USER,DB_USERPASS,DB_NAME,0,NULL,0);

UNIQUE ASSET IDS

A useful example of using a database during development is for the generation of asset IDs. I touched on this briefly in my article “Practical Hash IDs” (Game Developer, December 2005). The idea is that for the purpose of efficiency (both speed and space) it’s best to refer to assets using a unique 32 bit ID. In the previous article, I suggested using 32 bit CRCs for the ID. That approach has a number of advantages, but there is still the problem with collisions, and if you are going to use databases in a broader manner it makes more sense to generate the IDs using the database.

The simplest way to do this is to have a table that consists of an ID and an asset name (as a string). The ID field will be an integer, and set to autoincrement. (See Listing 2) Then whenever you add an asset name to the table, a unique ID will automatically be generated. To find the 32-bit ID number of any string, we simply look to see if it’s in the database, and if not, we add it. Then we just query the database for this string. See Listing 3 for a function that implements this.

Listing 2 – SQL that defines the simple table of IDs

CREATE TABLE `table1` (
`id` int(11) NOT NULL auto_increment,
`name` text,
PRIMARY KEY (`id`)
)

Listing 3 – Get a unique ID for a string identifier

uint32 GetID(const char *name)
{
char select_query[1024];
char add_query[1024];
MYSQL_RES *result=NULL;
MYSQL_ROW row;
uint32 id = -1;
sprintf(select_query,“SELECT * FROM table1 WHERE name=’%s’”,name);
if (!mysql_query(handle,select_query)) {
result = mysql_use_result(handle);
row = mysql_fetch_row(result);
if (!row) {
sprintf(add_query,“INSERT INTO `table1` (`id`,`name`) VALUES (NULL,’%s’)”,name);
mysql_query(handle,add_query);
mysql_query(handle,select_query);
result = mysql_use_result(handle);
row = mysql_fetch_row(result);
}
id = atoi(row[0]);
mysql_free_result(result);
}
return id;
}

These IDs are typically baked into the data as part of the build process, but can also be used directly in the code in the exact same way as was outlined in “Practical Hash IDs”. Note here that what we are doing is not a run-time process. The database is only intended to be used during game development, for the initial creations of ID by the team. If you are using CRCs, then it’s quite easy to modify this code to check for collisions.

TRACKING ASSERTS

A common issue during game development is what to do about asserts and warnings in the code. Warnings are often ignored by non-technical staff, and manual solutions such as “when you see this warning, come and tell me” are not very reliable. The line between asserts and warnings is often blurred in order to facilitate uninterrupted development. Asserts (which should indicate some fatal error which requires immediate attention) sometime have an “ignore” option, which gets switched on by the creative staff, who don’t care so much about tracking down your bugs. In a development environment with a large number of people, there could be a lot of asserts or warnings being fired off by your code, and it swiftly becomes very difficult to separate signal from noise, and to get the information to the correct person. Clearly some automated system would be useful.

Here we can easily use a simple database table to track these things. Listing 4 and 5 show a simple implementation of this.

Listing 4 – SQL Table for storing asserts

CREATE TABLE `asserts` (
`assert` text,
`message` text,
`file` text,
`line` int(11) default NULL,
`machine` text,
`time` timestamp NULL default CURRENT_TIMESTAMP
)

<<LISTING END>>

Listing 5 – Assert replacements that log asserts to a database table

char assert_buffer[1024];

void assert_printf( const char* text, … )

{

va_list a;

va_start(a, text );

vsprintf( assert_buffer,text,a);

va_end(a);

}

void SQLAssert(const char *assert, const char *file, int line)

{

printf (“%s, %s, %d\n”,assert_buffer, file, line);

char query[2048];

sprintf(query,”INSERT INTO `asserts` (`assert`,`message`,`file`,`line`,`machine`) VALUES (‘%s’,'%s’,'%s’,'%d’,'%s’)”,

assert,assert_buffer,file,line,GetMachineName());

mysql_query(handle,query);

if (0 /*don’t ignore*/ ) __asm int 3

}

#define NewAssert( test)\

if( !(test)) { \

assert_buffer[0]=0 ; \

SQLAssert(#test,__FILE__,__LINE__); \

}

#define NewAssertM( test, params )\

if( !(test)) { \

assert_printf params ; \

SQLAssert(#test,__FILE__,__LINE__); \

}

// Usage, note extra parentheses:

// NewAssertM(p==NULL,(“p not NULL (%p)”,p));

// NewAssert(p==NULL);

<<LISTING END>>

The macro NewAssert is a drop-in replacement for the standard assert() macro. The only parameter is a test that must return true. If it returns false, then the macro calls SQLAssert with #test (a string containing the actual test code), and the standard file and line numbers. SQLAssert then formats a string that will add these to the database.

In addition there is a more sophisticated assert macro, NewAssertM, that takes an additional parameter which is actually a list of parameters enclosed in parentheses. These are passed to the assert_printf() function which treats it as a sprintf into the assert_buffer. The assert_buffer is then passed to the database. This allows you to add an arbitrary string to the assert info in the database, usually this would contain the values of various variables involved in whatever you are testing. See the example usage at the end of listing 5.

So what we have now is an assert macro (or a warning macro) that you can track every single instance where it fires, and no longer have to rely on the artists and level designers (or even the testers) to accurately report what is going on. You can even leave it in for beta versions, and gather a large amount of data from a geographically diverse set.

The example shown includes a field for “machine”, which is intended to hold the machine name. Using this you can identify if a particular warning is going off a lot for one particular user. You could quite easily extend the assert logging to hold any additional information that might be useful, such as the IP address, or the current level name. Since this is a standard database, it’s very easy to query, extract reports into spreadsheets, and even generate graphs and web-pages from the information in the database. Interesting metrics can be generated, such as which source files trigger the most assertions, or even what day of the week has the highest rate of problems.

The 'time’ field in the table is set to “default CURRENT_TIMESTAMP”, which means that whenever the assert fires, this field is set to the server’s current date and time. This can be very useful in tracking down bugs, as you can see when an error or warning first occurred, and attempt to correlate that with whatever was changed around that time. This can be useful for prioritizing things. If a particular assert has been triggering for several days (or weeks) and nothing is being done about it, then it might be that it needs to be downgraded to a warning, or an informational message (or you might need to fire someone.) This kind of high level overview of issues can be useful when there are a large number of developers on a team.

OTHER USES

The two examples above make very little usage of the vast power of a relational database server. The data structures are essentially flat, and there is nothing “relational” about them. Normally a database would have multiple tables, cross indexed with each other to avoid data duplication. However, there is nothing wrong with using a database in this simple manner. It might seem like overkill to use it for logging – which you might do to a CSV text file – but it does not cost us anything to do it like this, and you immediately get the benefits of multiple remote connections to a robust repository, and very sophisticated filtering and report generation. The fact that we are barely using any of the features of the database is beside the point.

With that in mind, there might be other obvious areas where a database could be used instead of plain logging. User input could be logged if your database is fast enough. Many gameplay metrics could be logged in early test versions, such as how long it takes to complete each goal in each level. This data could be collated over thousands of runs, and used to fine-tune the difficulty level of a game. Remember that once you have the basic database set up and the connection software in place, it is very easy to add arbitrary new tables, and to start recording whatever you like.

Resources:

MySQL – A powerful open source database server. dev.mysql.com

Paul DuBois – Writing MySQL Programs Using C, MySQL, Third Edition, Sample Chapter: http://www.kitebird.com/mysql-book/ch06-3ed.pdf

December 3rd, 2008

Custom responsiveness measuring device

I’ve had some good feedback from my Measuring Responsiveness in Video Game article, which explains how to use a cheap camera to measure the lag between presing a button and the result appearing on screen.   One of the problems with the method I was using is that it does not let you measure responsiveness in typical gameplay situations, since you have to set up the controller so you can see the button being pressed, which means your finger or thumb has to start moving while not touching the button – generally not the case in normal gameplay.

The guys at Infinity Ward saw this problem, and their solution was to commission modder Benjamin Heckendorn to build a custom joystick that had a seperate display with an individual LED for each button.  Here’s the result:

With this setup, the board can just be placed next to the monitor, and both videoed together in normal gameplay.   Specific moves and events can then be isolated, and the frames counted in the video to measure the responsiveness. 

Frame-by-frame analysis of the button presses might also be useful for some analysis along the line of what I suggest in Pushing Buttons, although that’s still better done in code.

[UPDATE]  Ben now has a more detailed post on the controller on his site, with videos of it in action:

http://benheck.com/12-05-2008/controller-monitor-built-for-infinity-ward-game-testing

November 15th, 2008

My coding practices in 1991

I wrote this in 1991, when I was writing Amiga and Atari ST games for Ocean Software in Manchester, UK. I think at the time I was working on Parasol Stars. It’s an interesting look at a simpler time in games programming.

An explanation of my 68000 development system - by Mick West
------------------------------------------------------------

The purpose of this document is twofold:

	1 - To make the meaning of my code a touch more accesible
            to those that may come after me.

	2 - To remind me of the meaning of my code.


Fundamentals
------------

	A 68000 development system, at present only encompassing
the Atari ST and the Commodore Amiga. My present System is PC
based, using the SNASM assembler, though this could change.

	I try to be modular and to this end I have lots of small
INCLUDE files that contain short modules of code to do things
like random numbers, heap management, sprite printing, etc.
	These modules are (hopefully) written to use the minimum
of extrenal references and ideally to incorporate the functions
of an Abstract Data Type - ie a 'Black Box' of code that can
only be accessed by calling a number of specified routines. The
ADT module can do whatever function it is required to do in
whatever way it wants so long as it fullfills the specifications.
	A good example is the scroll routines I wrote at the
start of the 'Darkman' project. I actually wrote three diffent
scroll routines and they all did the same thing, but in
different ways, some quicker, some using less memory. All three
had the same external interface with the same named routines
(SETUP_MAP_BUFFER, MOVE_LEFT_4, etc) to facilitated the screen
scrolling. I could simply include one instead of the other and
the program would assemble exactly the same.

	There are a set of fundamental routines that practically
all games require, You need to read the keyboard, you need a
Vertical Blank Interrupt, you need a double buffered screen, you
need some way of reserving areas of memory. These basic routines
are provide by the file KERNAL.S.
	For all my projects, the first part of the source code
will INCLUDE \DEV\KERNAL.S. I should point out at this point
that all my modules that are not entirely specific to a
particular project are kept in a folder called \dev, which
should also include this file you are reading now.
	
	Let's have a look at the very simplest program you can
write using this method:

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; SIMPLE.S - A very simple program

	org	$1000
	regs	pc=$1000,sr=2700
	include	c:\dev\kernal.s	
main
	bra	main
my_vbl
	trap	#0
	rts   

; data areas
		rsset	free_memory
something	rs.b	10000

; end of SIMPLE.S
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

	Ignoring the comments and blank lines, the first two
lines are to tell SNASM where to assemble the program to and to
turn the interrupts off before running.
	The next line includes the kernal routines, as it is at
the start of the program then the first thing that will be
executed will be a small routine at the start of KERNAL.S that
kills the interrupt vectors, sets up the double buffered screen,
installs a keyboard interrupt and sets up a VBL. 
	The kernal needs two external labels: MAIN and MY_VBL.
The first, MAIN (memories of C) is just the start of your
program. The second, MY_VBL is a routine that is called once per
Vertical Blank Interrupt.
	In the example above, all MAIN does is go into a loop.
All MY_VBL does is a TRAP #0 to give SNASM access to the
machine. This is not done in the kernal because you might want
some control over when you let SNASM get it's claws in, maybe
one per game cycle or whatever.

	The last couple of lines are rather fundamental in that
they illustrate how I do my memory organisation. The Kernal
allows a fixed amount (configurable, the default is 100K) for
your raw code (ie - the assembled code, including all modules).
All memory above that is assumed to be free and is reserved
using the RS directive (which I use extensivly, so you have to
understand it to understand me).
	The kernal will reserve about 70k itself for the two
screens and the stack. Then the lable FREE_MEMORY will be SET to
the end of this reserved memory. Any modules that you include
may also reserve memory, if so they will set FREE_MEMORY to the
end of the area they have just reserved. So in your main
program, if you want to reserve areas of memory then you will
simply use 'RSSET free_memory' to set the __RS variable
correctly and the just use RS to Reserve Space as desired.
	If you are writing a module that needs memory then in
addition to the above you will also need to SET the FREE_MEMORY
variable so other modules and your main code recognises what you
have done.
	This may sound complicated (it does to me!) but if you
just look back at the example program you can see that in
practice it is really very simple.

	More about the Kernal - since this is common to every
program I write, I shall describe it's functions here. Modules
should (if I have the time/inclination/energy) have a full
specification at the start of the source. So if you want to know
what it does, read the comments not the code!

Kernal Configuartion variables (compile time)
if these are undefined then they will be given a default
	ST		true if target is an ST (true) 
	THE_RPF		rasters per frame (1)
	MAX_CODE	max code size (100*1024)
	RASTER_COUNT	if true then will display a raster count (false)

Kernal Variables for external usage.
	TIME.L		starts as zero, increments every VBL
	SCREEN.L	address of the virtual screen
	REAL.L		address of the real screen

Major kernal routines
	FLIP_SCREENS	flips screen and real (waits for RPF)
	
Major macros
	PUSHEM		save registers (always long)
	POPEM
	PUSH.s		save a register (sized)
	POP.s
	PAUSE jiffies	hangs for a certain time
	CLS screen	fills 32000 bytes with 0

This is not everything in the kernal, you should look at the
source comments for that.


Mick's Code Style Conventions
-----------------------------
	
	I am trying to develop a consistent style of writing
code, this will cut down on having to write unique routines
every time I write a new program. The style should not just be
about code, but about data structures and how the code accesses
them. Eventually I hope to have macros for the most common code
structures as soon as I have them fully formalised.

Mick's Maxims (not in order!!) (!)
-------------

1 - Use Meaningful Labels!
2 - Use Local Labels!
3 - Save Registers! 
4 - Don't repeat code, put it in a routine or a macro!
5 - Don't use numbers, use equates!
6 - Use lower case!
7 - Use data structures!
8 - Use consistent program structures!
9 - Use Comments! 
10- Make Backups! 



1 - Use Meaningful Labels! By meaningful I mean use full words,
not abbriviations, use underscores to separate words, use the
proper tense for the word and the correct plurality. EG:
	setup_heap
	flip_screens
	move_baddies
	game_over
	file_not_found
	
2 - Use local labels! Only use global labels for routines that
can be accessed externally. ALL labels within a routine
(including any local variables it uses) MUST be local. EG:

main_routine
	clr.w	d0
.loop
	add.w	#1,.counter
	bsr	another_routine
	bne	.loop
	rts
.counter	dc.w	0

next_routine
	...
	
3 - Save Registers! Unless you formally specify otherwise, all
registers not used to return values should be UNCHANGED when the
routine returns to caller. If you want to write faster versions
then do so, but SPECIFY that they may corrupt registers.

4 - Don't repeat code! Put it in a routine or a macro!

5 - Don't use numbers! Use equates! EG:

	add.w	#4,man_x		; is WRONG
	add.w	#walking_speed,man_x	; is RIGHT

6 - Use lower case! Upper case programs are a useless reminder
of the ancient days of computing. They make programs difficult
to read, and your comments either have to be in upper case, or
you have to mess around with the CAPS LOCK key. It's just a
waste of time.

7 - Use data structures! This is a big subject, addressed later.

8 - Use consistent program structures! Like mine for example.

9 - Use Comments! At the very minimum each routine should have
one comment, on  the line before the label. This should give a
brief description of what the routine does and it's input and
output requirements. A full specification of a routine should
include the following.
	Exact input/output requirements
	Register corruption
	The external references it needs
	Brief explanation (one line is ok) of the algorithm
	Possible areas for improvement
	Known bugs still to be fixed
As well as the comments at the start of a routine it is usually
a good idea to have comments in the routine itself. For
espescially complex routines it is a great help to development
to comment EVERY line to provide information on: Flow of
control, meaning of decisions, contents of registers and just
general elucidation of exactly what you think you are doing.

10 - Make Backups! Not strictly a coding requirement, but still
a vital part of development. I use an Archive program and have a
batch file to archive everything in the \DEV folder and
everything in the folder containing the project that I am
currently working on. They are archived to hard disk and then
copied to floppy, which is then taken home and copied to my hard
disk there.
	Besides the obvious frustration of having to rewrite
code that you have already slaved hours over, there is the fact
that there is a lot of money at stake here. Think money, make backup!

	Right, that's enough maxims for one day (ten was good
enough for Moses so it'll do for me) Now, let's expand on the
dual topics of Data Structures and Program structures. 
	At university they taught me somthing called 'Data
Driven Design', the idea was that you built your program around
the structure of what you were doing things too, the data.
	They also waffled a lot about Abstract Data Types which
I described briefly earlier and involve keeping the data
structure seperate from the program code.
  	Data driven design is really a very simple concept. Put
at it's most basic level all you do is list everything in the
program and then write routines to handle each of them.

	This document is not supposed to be a beginners guide to
structured programming, so I'll just describe what I do, and
what I think I might do if I get around to it.

	What came first, the program structure or the data
structure? I usually start with a loose outline of the program
structure and then write some specific data structures and then
the program and the data structures evolve together as they
respond to changing needs and new idea. This is not the way it
should be, but it has happend like that because of the usual
lack of time and energy. 

	Anyway, enough waffling, what is a data structure? Well,
it's like a RECORD in pascal or a STRUCT in C, which probably
leaves you none the wiser. Essentially it is an array or a list
of groups of bytes. It is best illustrated by an example.
	Say you are writing galaxians or something similar,
obviously you need to store the positions of all the aliens. To
do this you will just have a list of say 20 bytes for each
alien, in those 20 bytes you will store information such as the
x and y position, the speed, the sprite number, the energy, what
it will score, how long before it next does something and things
like that, depending on your game.
	To get information on a specifc alien you point an
address register to the base of the alien and then use offsets
to access the various bits fo data in that alien. EG:

	lea	alien,a0		; address of alien
	move.w	(a0),d0			; get x
	move.w	2(a0),d1		; get y
	move.w	4(a0),d2		; get sprite number
	bsr	draw_alien		; draw it

	In line with my maxim "don't use numbers", I dont use
numbers as the offsets. Instead I define my structures using the
RS command. This let's you set up a data structure as if you
were defining variables using DS, but instead of the labels
being absolute addresses they will be offsets relative to the
start of the code. For example:

		rsreset			; new data structure
alien_flag	rs.w	1
alien_x		rs.w	1
alien_y		rs.w	1
alien_xv	rs.w	1
alien_yv	rs.w	1
alien_sprite	rs.w	1
alien_energy	rs.w	1
alien_len	rs.w	0

	Having set up a data structure I will define an area of
memory at the end of the program (using RS and explained a page
or so ago) like:

alien_table	rs.w	max_alien*alien_len+2

	The table will be initialised by:

		fill	#alien_table,#max_alien*alien_len,#0
		move.w	#-2,alien_table+max_alien*alien_len

	So the extra word at the end of the table will be in
effect a dummy entry in the list with a flag value of -2. I
usually end my lists with a negative value in the first field.

	There will be a routine called NEW_ALIEN that will
return a0 as the first free baddy in the list. My usual way of
going through all the objects in a list is as follows:

	lea	alien_table-alien_len,a0
.next_alien
	lea	alien_len(a0),a0
	tst.w	alien_flag(a0)
	beq	.next_alien
	bmi	.done

; just an example of the sort of thing I might do with it.
	move.w	alien_x(A0),d0
	move.w	alien_y(a0),d1
	move.w	alien_sprite(a0),d2
	bsr	draw_sprite
	
	bra	.next_alien
.done

	
	You can kill a object using

		clr.w	alien_flag(A0)

	This sort of structure is the fundamental basis of my
programs at the moment.




	
-------------------------------------------------------------
	I have written this in a brief attempt to slightly
formalise what I am doing, mostly to get a few things clear in
my own mind. So if you are not me and this does not seem
entirely sensible to you then hard luck...
September 9th, 2008

Debugging Memory Corruption in Game Development

Definition:  Memory corruption is an unexpected change in the contents of a memory location.

The symptoms of memory corruption can range from hard crashes, all the way through minor glitches, to no symptoms at all. The causes of memory corruption are many and varied, and include memory corruption itself.   In this article I attempt to classify the various ways in which memory corruption can manifest itself, the various causes, and some ideas for identifying the root causes of various types of memory corruption.  I’ll cover:

  • Symptoms of Memory Corruption
  • Investigating Corruption
  • Identifying Hex Droppings
  • Causes and Effects of Corruptio

 

Symptoms of Memory Corruption

Given that memory corruption can manifest in almost any way, it seems redundant to list all the symptoms. However, different symptoms of memory corruption are inticative of different causes of corruption. Sometimes we can also gather valuable clues from the type of symptom, which might lead us closer to the cause of the corruption.

 

Crashes

Crash bugs come in all flavors, but memory corruption can cause just about all of them. The way in which the game crashes can give you valuable clues as to what type of memory corruption is occurring. These clues can indicate where you need to start looking for the cause of the crash.

Address Error

An address error indicates that a pointer has been modified to point to an illegal address. This could be an address that is: not word aligned, NULL, or an address that points to unmapped or protected memory.

Address errors are quite helpful, since program execution stops when an address error is encountered, and it is quite easy to enter the debugger, and determine the address and corrupted contents of the pointer variable that is being used.

Infinite Loop

Corruption of data can make a loop fail to terminate. Take for example code that traverses a linked list. Memory may be corrupted in such a way that the list contains a loop. Since the code expects the list to terminate with a NULL value at some point, it simply carries on around the loop forever.

This behavior is a lot more likely with a list that uses indexing instead of pointers, but it is still possible with pointer. Consider  the implication of memory being corrupted in such a way that a list gets a pointer corrupted so that the list is now circular. It is very unlikely that some random corruption, or some unrelated code would happen to stick a semi-valid pointer in the right place. Hence, it is more likely that the corruption was something in the list code itself.

 

Illegal Instruction

An illegal instruction could mean one of several kinds of memory corruption.

 

Stack Corruption – If the stack has been corrupted in some way, this can lead to an incorrect return address, which ends up pointing to illegal code. This is the most common way buffer overruns are exploited by hackers.

 

Jump Table Corruption – if a v-table (or any kind of table of jump-addresses) is corrupted, then the PC can end up pointing at an illegal instruction.

 

Code Corruption – The code itself can be corrupted by a bad pointer corrupting sections of the code. This type of corruption can be very hard to detect if the code that has been corrupted is not executed very often.

 

Stack Overwriting Code – A special kind of code corruption. Runaway recursion can sometimes run unchecked until the stack overwrites the routine that it is executing. This shows up nicely in the debugger in a hex window.

 

Function Pointers – Since function pointers are sometimes stored in data structures, and passed around like regular variables, then they can be corrupted just like any other variable. This can eventually lead to the program executing incorrect code.

Unexpected Values

If you have a variable of some kind that normal has a value in a certain range, and you unexpectedly find that the variable contains some ridiculous value, then this may be due to memory corruption.

Wildly unusual values often have noticeable effects, such as the player teleporting to the end of the universe, or a model being scaled infinitely large.

Less severe corruptions can occur, for example, a counter might simply be reset to zero or even just changed slightly. This type of corruption can be difficult to track down, as it may not produce especially noticeable effects.

Here a good testing department is invaluable. If the testers can notice little inconsistencies like this, then you will catch potentially harmful bugs at a much earlier stage.

Since the location of the corruption of memory is often somewhat random, then the problem may go undetected for some time. This may give the false impression that the existing code is solid. Upon adding new code or data, the bug may reveal itself, causing you to think that the new code has caused the bug, when in fact the new code has only cause memory to be slightly re-ordered into a configuration that reveals a pre-existing bug.

 

Glitches in the Graphics

Since memory often contains graphical data, then if memory is being corrupted, it may show us as some corruption in graphics. The way this is manifest will depend on the nature of the graphics, and the nature of the corruptions.

Textures

Changes in color of a single pixel, or a very short row or column of pixels, indicates that a pointer to a variable has acquired a wrong value, perhaps the result of earlier memory corruption.

Changes to large swathes of a texture indicate either an incorrect pointer, or some kind of buffer overrun. Corruption that looks like a regular patter, often containing vertical or diagonal stripes indicates some kind of array exceeding its bounds, or one that is now at an incorrect address due to a corrupt pointer.

Corruption in a texture that resembles a squashed or discolored version of another texture indicates that you might be overwriting the texture with another one of different dimensions or different bit depth.

If the corruption is static (unchanging), then it indicates a one-time event, where a pointer was misused just once. The corruption happened, and the game went along on its way. In this case, you need to try to track down what triggered that event. Testers need to try to find a way of duplicating the circumstances that lead to the visual corruption. Video of the game is very useful in this case.

If the corruption appears to be animating, if the corrupt section is flickering, or the banded area is flashing on and off, then you have some ongoing corruption. If the game remains in this state, it should make it easier to debug.

 

Meshes

Corrupt meshes usually result in some vertices being displaced a considerable distance from the model. If the corruption region is small (a word or so), then you may just see one vertex displaced, this will appear as a thin triangle or line that extends off screen and swings wildly about as the model animates.

Corruption of a large amount of the model’s mesh can result in the model “exploding”, covering the entire screen with random looking triangles that flicker and swing around.

 

Skeletons and Animation

Corruptions of the underlying skeleton data, or associated animation, can result in the model still looking somewhat recognizable, but with the various body parts being displaced to unusual locations. Corrupt animation will result in body parts flickering and jumping around wildly. The exact manifestation of the symptoms of corruption depends upon the method used to store the animations.

 

Investigating Corruption:

If you suspect that memory corruption is occurring, then your first step is to try to determine if this is actually some form of corruption, and what type it is.

 

Is it actually corruption?

 

Just because a value in memory looks rather unusual, does not mean that it was not generated by the code that owns that memory. The unusual value might simply be the result of an error in your logic. It could have been quite legally copied from somewhere else. It could be the result of computations involving incorrect data, perhaps data that was already corrupt.

To determine this, you need to determine if the code that you might think is writing to that location actually is writing to that location, and see what values it is writing. Ideally, you would add assertions at all location that you think might legally be writing to that location, and check the range of values that are being written (make sure the “corrupt” value is outside this range.)

 

Who owns that memory location?

Memory corruption usually occurs when some piece of code is using an area of memory that it should not. The corrupt memory then causes problems in some code

There are two primary ways in which this can happen: corrupting a legal area, and using an area illegally.

Consider a piece of code A, that uses and area of memory A(m). If another piece of code, B, also happens to have a pointer to A(m), and writes some data to that, then code B is corrupting memory A(m). This is the normal form of memory corruption.

Now consider if the code A is legally using memory location A(m). Code B is illegally also using some location within (or overlapping) A(m). Code B appears to work correctly, but then code A makes a legal update to A(m), causing code B to manifest a bug. It appears that code A is corrupting memory B(m). However, the fault here is with code B. It has the appearance of corruption, yet may mislead you to thinking that the problem is with code A.

It is important to determine who actually owns the memory location that is being corrupted. Is the “legal” use actually legal? Can you demonstrate that code B actually owns those memory locations? If you can quickly determine that code B does not actually own that memory, then the tracking down of code A is irrelevant, which can save you substantial time.

 

Repeatable, Fixed Location

If the corruption is consistent, meaning it happens in the same location and under the same conditions then you are (relatively speaking) in luck. Debugging in this case is a matter of somehow watching that location, and tracking down the cause of the corruption. Since the corruption happens under the same conditions, you should be able either to trap it immediately, or quickly narrow down the possibilities.

Intermittent, Fixed Location

If the corruption happens in the same memory location, yet is intermittent, then this makes tracking down the corruption more difficult. Since you do not know when the corruption occurs, you cannot be as focused in your search, and must rely more on general observation as to the nature of the corruption when tracking down the cause.

Intermittent, Variable Location

If the corruption happens in varying location, and at unpredictable times, then your debugging options are often limited to making observations about the corruption after it has occurred.

 

Determine the location of the corruption

If the memory corruption is the immediate cause of the bug, such as with an address error due to a corrupt pointer, then you may be able to immediately determine the effect of the corruption simply by seeing what address was being accessed at the time of the bug.

If the memory corruption is an intermediate cause, then you will track down the address of the corruption in the process of analyzing the immediate cause of the bug, and any intermediate causes that lie between the root cause and the symptoms.

Hardware Breakpoints

If your target platform has some kind of break-on-access breakpoint, then use this as your first line of investigation when debugging memory corruption with a know address. Simply set the debugger to execute a breakpoint when a memory location changes, then when the location happens, see what code is executing.

This technique can work very well if the location that is being corrupted contains data that is relatively static. However, if the location contains some dynamic variable that changes hundreds of times per frame, then you may have some difficult in finding the single write access that is causing the problem.

In that case, you may be able to augment your write-access breakpoint with a conditional check that verifies that the data being written to the corrupt location is in the valid range.

Sometimes memory is corrupted with vales that are within the valid range, but nonetheless are wrong. Your options here are more limited:

 

- Repeatedly run the code, and each time the breakpoint trips, look at the call stack until you see something that you do not recognize as code that can legally write to this location.

- If the legal places that write to this location are known and relatively limited, then update them to first write to some separate location. First ensue the corruption does not also affect that separate location, and then update the breakpoint condition to check the value written matches the stored value.

Try tracing through the code, stepping over functions, when you find one that does the corruption, and then next time around, dig into that function.

If you don’t have this debugger functionality, then you can still roll your own by writing a little function that checks that memory location.  You can then sprinkle calls to this function through your code, narrowing down the region of code that causes the error.   If the meory location’s address is dependent on the code, you may need to compile the code, note the new address, and then re-compile with the new address wired into the code. 

Another manual method is to keep track of allocations that include that particular location.   Memory corruption is often due to a dangling pointer that was once legal.  So if you know the location of corruption in advance, then having a list of the callstacks of all the allocations that once owned that location can help quickly identify the culprit.

Identifying Hex Droppings

A memory location has been corrupted. Assuming you cannot quickly find what bit of code is responsible for the corruption, you can learn a lot about what that piece of code might be by examining the nature of the corruption.

Once you have identified the location that has been corrupted, then look at a hex dump of it in the debugger (or print out your own if a debugger is not available). A hex dump looks something like this.

 

0x00322B90  fd fd fd fd ab ab ab ab  ýýýý««««
0x00322B98  ab ab ab ab ee fe ee fe  ««««îþîþ
0x00322BA0  00 00 00 00 00 00 00 00  ........
0x00322BA8  12 00 0d 00 22 07 18 00  ...."...
0x00322BB0  48 2b 32 00 40 2c 32 00  H+2.@,2.

 

The memory address is on the left, then comes the contents of memory, here listed eight bytes to a row, and then those eight bytes are repeated as ASCII characters

Single Bit Corruption

Few pieces of code will cause only a single bit to be flipped. The most likely candidate is a bit-field of flags.

Single Byte Corruption

If only one byte was modified, then that can narrow down the fields considerably. If the corrupt value is 0 or 1, then perhaps it is a byte flag.

Single 32-bit Word Corruption

A 32 bit word is often the fastest and most convenient way of storing data. It is the only way for certain data types such as floats or pointers (depending on your platform). Looking at the contents of the 32 bits will tell you something about the code that inserted that value there.

If you know that a 32 bit value is being corrupted, then you should view the memory location as a single 32 bit word, rather than as a sequence of four bytes. This removes any confusion with endianness, and makes the type of data much easier to recognize.

That said, it is also a useful skill to be able to recognize certain types of data as a byte stream, since the data may be intermingled (in a class) with other data of varying types. In the examples below we give the values both as a 32-bit integer, and as a four byte little-endian format, which harder to recognize than big-endian, since that is just the word with the bytes spread out.

Zero

Example:

00000000   or   00 00 00 00

Zero is easy to recognize. At first, you might not think there is much information in a zero, but consider the limited number of reasons a piece of code could be writing a single zero to a location in memory, and it may give some clue as to what piece of code might be responsible.

 
Zero is:

 

NULL – Perhaps the errant code is clearing a pointer? Some programmers make a habit of cleaning any pointer that is a member variable after they have deleted whatever it was pointing to (a reasonable practice to help prevent dangling pointers). However they might be doing it at the wrong time.

 

Zero. Both as an integer (0) and as a floating point (0.0f). Where in the code are individual values set to zero?

 

FALSE. Perhaps the code is treating the location as a flag, and simply setting it to FALSE.

 

The first value in an enum – Perhaps a type field, of a status field. What kinds of enumerations do you have in your suspect code? What does the first entry mean? What causes the code to write out the first value?

Clear and empty - Often data structures are initilized to zero.  Does this happen anywhere in the code?  Does the size of the data being cleared match the zeros in the corruption?

One

Example

00000001 or 01 00 00 00

 

One is also easy to recognize. Less common that zero, it can still tell you something about the code that wrote it there.

 

One is

 

TRUE – Perhaps it is being used as a flag. What could be set to TRUE?

 

An integer – Hence it’s not a floating point number. You can discount code that stores floats.

 

Not a pointer – Odds are that the code causing the corruption is not thinking that it is storing a pointer, unless it is a secondary bug.

 

The first value of an enum – like any small number, it’s possible it is an enumerated value, possibly a type number.

 

Floating Point Numbers

 

Example

3F800000 or 00 00 00 80 3F

 
Many floating point numbers have an easily recognizable format. A very common floating point value is the one shown above, 3F800000 is the hex representation of the 32-bit floating point value of 1.0. See Table 24.X for additional values.

 

Table 24.X

Float			Int

0.00000000		00000000
0.50000000		3F000000
1.00000000		3F800000
-1.0000000		BF800000
2.00000000		40000000
100.000000		42C80000
0.33333334		3EAAAAAB
3.14159274		40490FDB

 

Notice how the small values start with a 3. A floating point number has the first bit being the sign, the next eights bit being the exponent, and the following 23 bits being the fractional part. Since numbers in the same range tend to have similar exponents, you can often recognize a group of floating point numbers of similar magnitude.

In games, a very common range for floating point values is from -1.0 to +1.0. These numbers are used extensively in unit vectors, transformation matrices, UV coordinates and scaling factors. Numbers in this range usually start with a 3 (for positive numbers), or a B (for negative numbers).

If you suspect it is a floating point number, you can then sometimes tell if it is an original (hard wired in the code) value, or a value arrived at by calculation. Consider the numbers above. The values 1.0, 2.0, 0.5, 100.0 all have trailing zeros in their hex representation. The value 3.3333334 also a sequence of AAAAAA in it.

By contrast, the less rational number 3.14159274 has what seems to be a random string of hex digits. We can see the degree of entropy in the hex number matches that in the floating point number.

So, a floating point number that has been the subject of some computations is much more likely to have random looking hex digits. Hence, uou can tell if you are looking for code more like from an update function:
 

p->m_speed = sqrtf(p->m_speed*p_m_speed - 2.0 * g * h);

or from an initialization function

p->m_speed = 2.5f;

 

Small Integers

 

Small integers (in the range 0 to 10000) are usually counters or enums. If you see the value incrementing or decrementing evenly, then that indicates a counter.

If you see it oscillate between a few fixed values, then it is probably some kind of state variable.

Does this small integer seem to match anything in the game at the time of corruption? Some possibilities:

Score
Health
Lives
Level number
Weapon number
Button Pressed

Try to find some correlation between what is going on the game, and the value of corruption.

Large Integers

 
As numbers get larger, the number of uses for them decrease. It’s unlikely that you will be managing groups of over 100,000 items. If you have a large integers that look like thye are counting, then you should consider what it might be counting.

Consider then if it might actually be a pointer, or a code address, and not an integer value at all.

Negative Integers

Example:

FFFFF3A2 of A2 F3 FF FF

Negative integers start with ‘F’s rather than ’0′s.

Integers are generally used for counting things. If you have a negative integer, then that greatly narrows down the range of things it might be used for.

Some code uses the negative form of an integer as a single kind of flag to change the behavior of the code, avoiding the need to have an additional flag.

Negative numbers are also sometimes used as error codes. Some functions take a pointer as a parameter, and then return the error code in the location pointed to by the pointer. If the pointer is incorrect, that will lead to memory corruption with a negative number.

 

Magic Hex Numbers

 

Example

DEADBEEF or EF BE AD DE

 

A magic hex number in the context of debugging is a hex value that has been specifically chosen by the programmer to be visible in the debugger.

The numbers are also chosen so that using it inadvertently will maximize the chances of that use causing an error, and hence alerting the programmer to the illegal usage.

The most common use is in initializing a block of memory to certain values both when it is allocated and when it is freed. This both makes the block visible in the debugger (in the memory window), and also fills it with values that the programmer should notice if they are used either before the memory has been initialized correctly, or if the memory continues to be used after it has been freed.

 

Common Magic Hex Numbers are:

CCCCCCCC
CDCDCDCD
DEADBEEF
DEADDEAD
DDDDDDDD
FDFDFDFD

 

Use of magic numbers varies by platform. Often developers use their own magic numbers, and they tend to prefer those that can be read aloud, such as DEADBEEF.

 

Magic ASCII

 

Example:

474E5089 or 89 50 4E 47 or ‰PNG

 

Frequently asset files are identified by a four byte (partially) ASCII string that indicates the file type in some human readable way. It’s quite unlikely that this will find its way into a single word corruption, but it’s worth looking in the ASCII column in the memory window, just to check if this is the case, since if you recognize this, it should hopefully point you directly at the culprit.

 

Pointers

 

Example

00434150 or 50 41 43 00

 
Your program usually occupies a relatively small amount of the available four gigabyte address range of a 32-bit pointer. Hence, pointers usually fall within a recognizable range.

Under Win32, your executable starts at address 00400000 (4MB from the start of it’s virtual address space) so function pointers, and pointers to static data will often start with 004 (and 005, 006 etc as your program increases in size).

On the PS2, your executable start at 00100000 (1MB), so pointers will start with 001, 002, etc.

Function pointers are an unlikely candidate for corruption data, so if you see a pointer like this, it’s more likely a pointer to some static data.

The most common type of pointer to static data that is passed around is a pointer to a string. If it looks like you have a pointer in your corruption data, then try following it and see if it points to a recognizable string.

Depending on your platform, pointers may be more likely to be word aligned. On the PS2, pointers to code or any word sized data must be word aligned. The PC allows all data referencing at the byte level.

 

Random Numbers

 

Example

9D29F113 or 13 F1 29 9D

 

When you look through the memory occupied by your game, you will find surprising little data that looks random. There are usually lots of zeros, and where the data is more closely packed, certain bytes or patterns predominate.

So when you find a number that looks random, it almost certainly has some meaning. Here are some of the things it could be.

 
A floating point number – as mentioned previously, a floating point number with several significant digits will look kind of random. The constant pi (3.141592654) comes out as 40490FDB – which looks random.

 
A checksum – if your code uses a checksum, such as CRC32, for some reason, such as identifying assets, then this could be a stray one. If you have the capability, then try seeing what string generates this checksum.

 
Compressed data – well compressed data should look random. It’s unlikely that it would end up in a single word of corruption, but possible.

 
Text – It looks random at first sight, but if the bytes are mostly in the range 0×30 to 0x7F, then it is quite possible that it is a fragment of a string. See what it says in the ASCII column of the memory window.

 

Block Corruption

 

Block corruption is where a group of words in memory are corrupted more or less together. The block can be any size, but we are generally talking anything from four bytes to 1024 bytes.

The corruption data in the block may contain any combination of the types of corruption data found in a single word, as discussed previously. There are a few situations specific to block corruption.

 

Partial corruption

 

When the data in the block of memory covered is not entirely corrupt, just say every few bytes or words has been changed, then this is a good indication that we are dealing with a pointer to a data structure (a structure or class) that has gone astray.

The most likely explanation is a dangling pointer. The code is continuing to update some data structure that has already been freed.

Full corruption

 

If the block of corruption is contiguous and no byte within it remains unchanged (except for a few common bytes, like zero, that might exist frequently in both corrupt and correct data), then it seems like the data structure has either been initialized, reset, or copied from somewhere else.

 

Unit Vectors

 

A common arrangement of three floats is in a vector, and a common sub-group of vectors is the unit vector. Unit vectors are quite recognizable in memory, since they consist of three small floating point numbers (in the range -1.0 to +1.0), and so they frequently start with the hex digit 3 or B.

Here’s an example of a unit vector sitting incongruously in the middle of a string.

 

5c6b6369 73636f64 6d61675c 6e697365  ick\docs\gamesin
3e6fdb1a bd0ee1b0 3f7909cd 6f635c6b  .Ûo>°á.½ Í.y?k\co
655c6564 706d6178 5c73656c 6d617865  de\examples\exam

 

Looking at the hexadecimal, it is not immediately obvious that anything is wrong. We can see however from the text display column that there seems to be some garbage bytes in the middle of the path name.

Looking more closely at the garbage bytes, viewed as words, we can see that two of then start with 3, and one starts with B – a very good indication that we are dealing with a vector of small numbers, possibly a unit vector.

 

We can then switch to a floating point view, which gives us:

 

2.6502369e+017  1.8019267e+031  4.3599426e+027  1.8062378e+028
0.23423424      -0.034883201    0.97280580      7.0364824e+028
6.5049435e+022  2.9386312e+029  2.7403974e+017  4.3612297e+027

 

This confirms the nature of the corruption. We have three floating point numbers in the range -1.0 to +1.0, we can do a quick calculation to confirm that if we square the numbers and add them it comes out at about 1.0, so the length of the vector is 1.0, a unit vector.

 

Causes and Effects of Corruption

Once you have determined the likely nature of the corruption, you need to identify the piece of code that caused the corruption. If you are not able to directly observe the corruption taking place, you may have to selectively instrument suspicious pieces of code.

To narrow down the field of pieces of code that might be considered, we should have a look at the most common direct causes of corruption, and examine how each cause manifests itself.

Buffer Overruns

A Buffer overrun is perhaps the most common type of bug. You often hear about “buffer exploits” in the hacking world. Here a programmer has neglected to check that the size of the input data fits into the destination space. The data overruns the buffer, and possibly overwrites some space used for code. By adding some appropriate code to the end of the data, an industrious hacker can inject some of his own code into an application and take control of it.

Buffer exploits are less of a security problem for game developers, unless they are accepting data over the internet. However buffer overflows are still a very significant cause of bugs.

 

 

Bad Pointers

 

If the value of a pointer is incorrect, then it can corrupt memory (as well as providing bad data to whoever uses that pointer). The value in a pointer can become “bad” in a number of ways.

 

Dangling Pointer – If a memory block is de-allocated or freed, yet some pointer still references that block (or an object within that block), then that pointer is said to be a “Dangling Pointer”. The value of the pointer has not changed, however the pointer has become bad since it no longer points to valid data.

 

Incorrect Pointer Calculation – The pointer could be generated using incorrect pointer arithmetic, or using other values that are themselves incorrect, causing the value of the pointer to be calculated incorrectly. Pointer arithmetic might also return a pointer out of range of the target buffer – a form of buffer overflow.

 

Corrupt Pointers – The memory in which the pointer is stored may itself have become corrupted due to some unrelated cause. Thus corruption can cause corruption, extending the chain of causes.

 

Bad Local Pointers

If a pointer is created to an object that has local scope, then that pointer will only be valid while that object is in scope. See Listing 1

 

Listing 1

void CheckThing(CThing *p_x)
{
  CThing p_thing;
  p_thing = *p_x;
  if (ThingCheck(p_thing))
 {
    AddToList(p_thing));
 }
}

 

Here a local variable p_thing is being used for some temporary purpose. However, during the course of the function the variable is added to some global list, then the function returns.

The result is that there is now a pointer in some list somewhere that points to memory that is used by the stack. This will not be an immediate problem, since when the function returns, then the stack pointer will recede higher in memory, leaving the instance of p_thing safely below the stack. Then one of two things might happen.

Object gets corrupted – the object pointer to by p_thing now no longer legally exists, however its binary image is still in memory, and code can continue to use it without problems until the stack once more descends below that location in memory. At that point the object may get corrupted. This, in a sense, is not a memory corruption bug, since the writes are legal, and in the correct place. But it behaves very like a corruption bug.

Stack gets corrupted – the object is in a list, and presumably some operations are going to carried out with it. When the stack descends past this point in memory, then if that object is updated via the list, then updating the object will corrupt some memory that is legally being used by the stack. This could be a return address, it could be a saved register value, or it could be local variables in some routine higher up the call stack. Whichever it is, the effects will be deferred until the function call stack returns to that point, which could be quite distant from the cause of the problems.

 

Stack overflow

The stack overflowing can cause memory corruption in a number of ways. A stack is of a fixed size, and immediately it overflows that size it has begun to corrupt memory. What happens next depends on the size of the stack frame, the position of the stack in memory and what lays beneath it in memory.

Not all platforms are equally vulnerable to corruption for stack overflow. Win32 will simply raise a stack overflow exception if the stack pointer writes beyond the bounds of the stack. On platforms that do this, debugging is a relatively simply matter of looking at the call stack, which should have one or two functions repeated over and over, pointing you directly at the culprit.

Other platforms are less fortunate. The PS2 has not special protection for the stack pointer. The stack is frequently placed at the top of the 32MB of memory, which means that if it grows downward past the area reserved for it, it can corrupt data and possibly even code.

Code Corruption

 

As already mentioned, if the stack is allowed to proceed apace through memory, it will eventually overwrite the code that is currently being executed, causing a crash.

Code corruption due to stack overflow can often be seen in the disassembly view of the debugger. If the code before the crash location looks reasonable, and the code after looks repetitive or contains illegal instructions, then overflowing stack corruption is the most likely cause.

 

Sparse Corruption

 

The common cause of stack overflow is runaway recursion. A less common cause is moderate recursion combined with a very large stack frame. This occurs when the programmer has a local variable in the recursive routine that takes up a large amount of space. Example: See listing 2

 

Listing 2

class	CBuffer
{
int x[2048];
}

int DigTree(CTree *p_tree)
{
 CBuffer local_buffer;
 Dig(p_tree,local_buffer);
 if (NotFinished(p_tree))
   DigTree(p_tree);
 Finish(p_tree,local_buffer);
}

 

Here DigTree is a recursive function that has a local variable local_buffer. A new instance of local_buffer must be created on the stack. Since the CBuffer class takes 8K of memory, it takes relatively few recursions to overflow the stack, especially on consoles such as the Gamecube where the stack size is kept as low as possible, often down to 64K or less.

If the huge CBuffer object is not cleared every time the function is entered, then this can have the effect of corruption being evenly spaced every 8K through memory. This can be quite a red herring in a number of ways. Firstly, the first time you see the corruption it might seem to be just a single instance of corruption, making you not think of a stack overflow.

Secondly, it can overstep any tests you do to check for stack overflow. Often you would place some magic numbers in the bottom of the stack, and check to see if they are still there, as a way of detecting the stack has overflowed. If the game does not immediately crash during the recursion, the stack pointer will return to a normal address, and your code will run along merrily until the corruption causes some later problem.

Thirdly, if it is runaway recursion, then the large stack frame might overstep the code that is being executed, causing widespread code corruption, yet not actually crashing in the code that caused the corruption. While this should still point you to stack overflow, the culprit will be less obvious. A stack analysis on the corrupting stack frame will indicate the location of the corrupting code – providing you can detect the write that causes the corruption.

August 18th, 2008

CPanel Hotlink Protection Breaks WP Permalinks

I was messing with the hotlink protection in CPanel, and just toggled it on and off, unfortunately this broke the permalinks in wordpress. Specifically it removed the line:

RewriteEngine On

In .htaccess

Took me a while to track down.

With wp-supercache, the mod_rewrite part of your .htaccess should look like:

<IfModule mod_rewrite.c>
# BEGIN WordPress

RewriteEngine On
RewriteBase /
#RewriteCond %{QUERY_STRING} !.*s=.*
#RewriteCond %{HTTP_COOKIE} !^.*comment_author_.*$
#RewriteCond %{HTTP_COOKIE} !^.*wordpress.*$
#RewriteCond %{HTTP_COOKIE} !^.*wp-postpass_.*$
#RewriteCond %{HTTP:Accept-Encoding} gzip
#RewriteCond %{DOCUMENT_ROOT}/wp-content/cache/supercache/%{HTTP_HOST}/$1/index.html.gz -f
#RewriteRule ^(.*) /wp-content/cache/supercache/%{HTTP_HOST}/$1/index.html.gz [L]

#RewriteCond %{QUERY_STRING} !.*s=.*
#RewriteCond %{HTTP_COOKIE} !^.*comment_author_.*$
#RewriteCond %{HTTP_COOKIE} !^.*wordpress.*$
#RewriteCond %{HTTP_COOKIE} !^.*wp-postpass_.*$
#RewriteCond %{DOCUMENT_ROOT}/wp-content/cache/supercache/%{HTTP_HOST}/$1/index.html -f
#RewriteRule ^(.*) /wp-content/cache/supercache/%{HTTP_HOST}/$1/index.html [L]
#
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

# END WordPress
</IfModule>

In my case the “RewriteEngine On” was missing, causing individual pages not to work, but the main page worked fine, so I did not notice.