Practical Hash IDs « Cowboy Programming

January 4, 2007

Practical Hash IDs

Filed under: Game Development,Inner Product — Mick West @ 6:28 pm

This article originally appeared in the “Inner Product” column in Game Developer Magazine, December 2005

Game assets frequently need to be referenced indirectly by the engine, by code and by scripts. The simplest way to do this is by using the name of the object. However this is very inefficient and otherwise problematic. The article explains the practice of using a 32 bit hash of the name as a unique ID and explores some ways of integrating this into your code and tool pipeline.

PREMATURE OPTIMIZATION?

This article is about an optimization. Optimizations are typically done at the end of a project. Donald Knuth famously quotes Tony Hoare saying: “premature optimization is the root of all evil” . But he prefaces this with an important caveat: “about 97% of the time” .

This optimization is one of those 3% (or more, opinions vary) best done early. It’s such a fundamental change that it affects many aspects of the game engine code and data structures. It’s not an optimization you do at the end to speed things up, but rather one you do at the start, to enable you to do more right from the early stages of development.

THE NEED FOR ASSET IDS

A game contains a very large number of assets. Meshes, textures, animations, skeletons, scripts, sounds, effects, triggers, events, NPCs, and various other miscellaneous resources, objects and entities.

These assets all need some way of being referenced. Either as part of some internal linkage (a mesh will reference the textures it needs), or as some intentional reference on the part of the designer (“play the Man_jump_01a animation” , “blow up building City04_Bank01” ).

These references need to be resolved at run-time, as this gives you a much more dynamic development environment than statically linking everything together at compile time. Assets can be changed independently and re-loaded. Small sections of the game data can be recompiled quickly without having to recompile everything. (See Sean Barrett’s “Late Binding Data” , Game Developer, Feb 2005).

The assets will nearly always have a name. Textures have the file name of the original .PSD, .PNG or .TIFF. Scenes exported from 3D Studio Max or a similar package will have a name associated with each object in the scene.

So the simplest way of referencing an asset is to just use the string containing the name. This has a great advantage in that it’s very straightforward. However there are several disadvantages.

PROBLEMS WITH STRINGS

Firstly, string take up a surprising amount of memory. Asset names are often quite long, especially if paths are included. So if every mesh has to reference every texture it uses by name, then that can add up to an unnecessarily large amount of memory.

Secondly, it’s slow. Comparing two strings takes a lot longer than comparing two numbers. Quite often asset names will have long common prefixes, meaning you get a lot of false partial positives if you are looking up the string in a table. Plus the length of the string means you need to perform a large number of memory accesses, which can be very slow on architectures with expensive data reads, such as the PS2.

Thirdly, it does not fit into data structures nicely. The strings are of random length, making it difficult to form neat data structure around them

Fourthly, having a large number of strings visible in the data might be something of a security concern. Plot spoilers might be deduced from the names of assets (“Hey, what’s this ”˜MDL_SPIDERWOMAN’? Sweet!” ). And if you want to stop people hacking left-over content in your game (like the infamous “hot coffee” mod in GTA-SA), then you’re better off not giving them a load of clues.

USE THE HASH VALUE

So, instead of using the string itself, one approach is to use a hash value of the string. As you probably know, a hash value is a number calculated by mangling the bits of the string together using some algorithm to produce a fixed number of bits. There are various hash algorithms that produce various sized outputs. Algorithms such as SHA, MD5 or RIPEMD-160 produce hash values of 128 or 160 bits in length (16-20 bytes) and so don’t really help us very much.

The string hash algorithm that best fits our needs here is CRC-32. CRC-32 takes a string and gives you a 32-bit value. The algorithm is specifically designed so that small differences in a string produce very large differences in the output. This is very important, since you don’t want string such as “ANIM_10002” and “ANIM_20001” to result in the same hash value (as they would if you did a traditional checksum where you added or XORed the bytes or words together).

Using the hash in place of the string immediately removes the problems I mentioned above. Comparisons are now very fast, since you only need to compare two 32 bit values. Memory accesses are reduced to a bare minimum. Data structures can now have a nice neat single 32-bit word to store the reference rather than a variable length, memory hogging string.

Using a hash has other benefits. Since it’s a hash value, you can use it as a very fast index into a hash table simply by lopping off however many bits you need – for an 8K table, just used the bottom 12 bits. Or even easier, you can pick 12 bits out of the middle of the word to match the power of 2 alignment that your table is on (so, if your hash table index is 8 bytes per entry, then just mask off bits 14 through 3, and you’ve got the offset into your hash table index.

Another useful technique can be used when game assets need to have pointers to each other. By using a 32-bit hash you can write out a data structure identical to that used in the game, but with the hash values in the memory location usually occupied by the pointers. Then when the asset needs to be loaded and bound into the game, you simply replace the hashes with the appropriate pointers. This removes the need for building data structures from parsed data, and can greatly speed up the loading process, while still giving you the flexibility of late-binding data.

FEAR OF COLLISIONS

So why don’t people us hashes for IDs? Well the first thing people think when presented with this idea is “what if there’s a collision?”

A collision is when two strings have the same hash value. Since with a 32 bit hash value there are only about 4 billion possible hashes, then if you throw enough strings in, then you will eventually find two that have the same hash value.

The probability that any two strings will have the same hash value is 1:4,294,967,296 (about 1 in four billion). With a larger number of strings, the probability rises. At about 65,000 strings there is an approximately even chance that there will be a collision – at least one pair of distinct string will have the same CRC-32 hash value.

For a practical example, I did a check for collisions on 89,000 unique string culled from some random dump of a dictionary, and there was only one collision. The strings “bellys” and “cochleariform” both have the same CRC32 (0x5D1F90F6). Later when checking a list of 216,000 strings, I only got seven collisions.

Remember there is a big difference between finding a collision somewhere in a large set of strings, and the possibility of a new string causing a collision. If you’ve already got 89,000 strings with no collisions, and you add one more, the chances that there will now be a collision are 89001/232 or about 1 in 50,000.

So depending on how many IDs you actually use, you are going to have to deal with collisions very rarely.

DETECTING AND HANDLING COLLISIONS

But collisions will occur, and at some point you will need to detect collisions, and then deal with them.

To detect collisions, you need to keep an automated central database of all the strings and the hashes that are used. This database is solely used during development, so won’t require any overhead in the actual game. It need not be complex, just a simple list will usually suffice, and you don’t need to worry about removing unused entries – just purge it whenever you do a full rebuild.

When you find a collision, the simplest way to deal with it is to change the string. Now I know that this sounds like a rather hacky solution. But it’s really not going to happen very often. You’ll perhaps have to change a string once a month towards the end of the project, and not at all during the first half of the project, when the number of unique strings is still very low.

USING HASHES IN CODE

The other reason that people don’t use hashes is that they can’t read them. A simple implementation of this system would involve someone running a CRC32 program on the string, and then manually typing in the checksum into the code where the asset was being referenced. Instead of typing “LoadAsset(“MDL_ENEMY_GUNSHIP” )” , you type LoadAsset(0x23C55C3A). Nasty.

Ideally the user (in this case the programmer or designer) should never have to see these hexadecimal strings. The entire process should be automated to make it both easier to read, and less prone to errors. (I’ve cut and paste the wrong hash value a few times in the past).

When dealing with compiled data such as meshes, animations, textures etc., this is relatively straightforward. You just compile the strings into hashes when you compile the data into whatever engine specific format you need.

When you need to reference an asset from C/C++ code, the story is somewhat different. Here are four ways of doing this, in the order I first considered using them.

1) Just use the hash values directly – Not a very good option. You’ll type them in wrong, there’s no easy way of figuring out if it’s the right hash value. You can improve things slightly by adding a comment next to the hash value, but it’s still a messy and fragile method.

2) Use #defines – In this system you add all your strings to a header file (say crc_hashes.h), with a line for each possible string. (listing 1).

Listing 1 – a header file containing all the hash values

// crc_hashes.h
#ifndef __CRC_HASHES_H__
#define __CRC_HASHES_H__

#define CRC_MDL_MECH01_LARGE    0x4854987A
#define CRC_MDL_MECH02_LARGE    0x374FAAD2
#define CRC_ANM_FOX_JUMP        0xF003C8A8
// add additional crcs here ...
#endif

Then in the code you just use the CRC_ symbols. This initially seems like a reasonable option, but suffers from the huge downside that you have to include crc_hashes.h in every single file that uses a hash value. This effectively means that whenever you add a new hash, you need to re-compile a significant portion of the game.

3) Use a macro that checks the hash value, but is compiled out in “ship” mode. (listing 2). With this you enter in both the string and the hash value directly into the code. (see example in listing 3).

Listing 2 – a macro for checking your hashes at runtime in debug mode.


inline uint32 check_hash(uint32 id, const char *name)
{
    assert(id == crc32(name));
    return id;
}

#ifdef DEBUG
#define HASHCHECK(id, name) check_hash(id, name)
#else
#define HASHCHECK(id, name) id
#endif

Listing 3 – Showing the different ways of using hash values in code.

// hash value directly:
    Load(0x334BDCF8);
// hash value with inline comment
    Load(0x334BDCF8 /*MDL_MECH01_LARGE*/);
// using #define in a common header file
    #include crc_hashes.h
    ...
    Load(CRC_MDL_MECH_01_LARGE);
// Using a macro to check vlues
    Load(HASHCHECK(0x334BDCF8, "MDL_MECH01_LARGE"));
// Using additional preprocessing
    Load(HASH("MDL_MECH01_LARGE"));
// The same line after the custom build step:
    Load(0x334BDCF8 /*MDL_MECH01_LARGE*/);

This ensures your hash values match your strings. In ship mode it will compile away to leave just the hash value itself. This works well enough, and it’s relatively straightforward to set up a hot key your editor to automatically generate the hash, and the call to HASH(), and insert them in the code, so you don’t have ever deal with the hash value directly.

However there are still problems. Firstly you’ve still got the hash value there in the code, leaving you open to cut-and-paste problems. Secondly the check is at run-time, so if it’s an obscure bit of code (maybe it only runs at the end of the game?), then the check might not fire until a few weeks after you add it. Thirdly it’s slow in debug mode since you have to generate a checksum for each string.

As I wrote in another article (Game Developer, Dec 2005), I don’t think there should be separate DEBUG and RELEASE modes, but rather a single DEVELOP mode, used for both purposes. Having the code check the value of a hash value against a string is slow, and it’s done too many times. The hash value might be used several times per frame, but if it’s right once it’s always going to be right. You can hack the macro so it only checks once, but you’ve either got to add another check, or write some self modifying code.

4) Use a custom preprocessor. With this approach everything is automated for you, you simply use the string itself and tag it by placing it as a parameter a pseudo-function, called HASH(). You then need to introduce a custom build step to your build process that scans your source for calls to HASH(), and replaces them with the correct hash value. (See listing 3 for a comparison off this all the methods)

The great benefit of this system is that the programmer never has to see (or type) another hex string in order to get something in the game. Once the system is set up, the programmer can continue to refer to assets by name, and the optimization of using hash values will take place under the hood. The disadvantage is that you have to add a custom build step that generates intermediate files to be compiled. If you already have such a step it’s probably not such a big deal to extend it to include this. But changing your build process is not a trivial task.

Introducing a custom build step runs the risk of making your code less portable. A new platform might not be as flexible in its compiler options. To get around this, you can define the function HASH() as a function that calculates and returns the checksum of the string passed to it. That way, even if the additional compilation step is missing, the code will still work perfectly (albeit a little slower).

One additional place in your code where the use of these string id hashes needs special consideration is in switch statements. Since the value after a case keyword must be an absolute value, and not the return value of a function, then using the HASHCHECK() method will not work. A custom preprocessing step will work fine, as the value is automatically inserted and there is no use of macros. However now the fallback method of also defining the HASH() function will not work for switch statements.

OTHER CONSIDERATIONS

By default, CRC-32 is case sensitive. For this use though it is probably a good idea to make it case insensitive. This is especially true if you are going to use hashes in some kind of script environment used by designers. Since you never really want to have MDL_BigJump01 mean something different from MDL_Bigjump01, then it makes everyone’s life easier. Making your CRC routine case insensitive will also not make it any slower if you use a table based method (just duplicate the upper case entries with the lower case entries).

If you are using hashes to uniquely identify objects in the game, you might want to reserve some for specific system objects. You could do this by adding a few names like SYSTEM_0001, SYSTEM_0002, etc. Or you can simply reserve the first (say) 1024 values. (0x00000000 through 0x000003FF), and explicitly report them as illegal values in your database. This might seem a little odd, not allowing this range of values for a hash, but only one out over every 4,194,304 possible strings will have a hash value in this range.

Given a hash value from a string A (HASH(A)), and given another string B, then you can calculate HASH(A+B) by calculating HASH(B) using a starting value of HASH(A). This means you can calculate HASH(A+B) without ever knowing the actual contents of the string A. This is very useful if you have assets that have a series of different extensions or other appended strings (e.g.: MDL_ROBOT_01_ might have MDL_ROBOT_01_LEFT_ARM, MDL_ROBOTO_01_LEFT_LEG, etc.) You can quickly calculate the hash values of the string with the extensions without having to know the original string.

RESOURCES

A description of the CRC-32 algorithm.
http://en.wikipedia.org/wiki/CRC32

A Practical and Public Domain implementations of CRC32
http://www.csbruce.com/~csbruce/software/crc32.c

Comments (2)

2 Comments »

[…] memory (complicates serialization). For more comprehensive discussion see article by Mick West – Practical Hash IDs (read it before continuing with this one).What’s the alternative? Well, we can use hash […]

Pingback by Hashing made useful | .mischief.mayhem.soap. — July 9, 2008 @ 12:53 pm
[…] collision-free. More information in this usenet thread). Mick West proposes more solutions in his Practical Hash IDs […]

Pingback by Compile-Time Strings | EntBlog — April 28, 2009 @ 3:19 pm

RSS feed for comments on this post. TrackBack URL

You must be logged in to post a comment.

Cowboy Programming Game Development and General Hacking by the Old West

January 4, 2007

Practical Hash IDs

2 Comments »

Leave a comment