Cowboy Programming Game Development and General Hacking by the Old West

January 5, 2007

Blob Physics

Filed under: Game Development,Inner Product — Mick West @ 1:57 pm

This article was originally published in the “Inner Product” column in Game Developer Magazine, May 2006

USING VERLET PHYSICS TO SIMULATE BLOBS

Download the blob code and executable:
DOWNLOAD
91K

Games such as Gish from Chronic Logic on the PC and LocoRoco from Sony on the PSP use a 2D physical simulation of a blob as the main character. The physics behind this blob provides the main basis for the gameplay within these games. Since the focus is heavily on gameplay, the actual physics has very little relation to reality, and is not the kind of thing you find covered in books on game physics. This article sets out the basics behind one method of 2D blob physics, and discusses some of the practical implementation issues. I also provide full source code and a demo program for a working blob system.

MASS SPRING SYSTEM

Both games mentioned use a model that has been used for decades; a “mass spring system”. This is simply a collection of point masses connected by a series of springs, roughly in the shape of the object you desire. You can think of it like a particle system, where each particle is attached to some other particles by a number of springs. See figure 1 for a simple example.


Figure 1 – A simple mass-sping system with three masses and three springs

A simple spring connects two points, and has four basic parameters.

1. The rest length, which is the length of the spring when it is neither stretched not compressed
2. The minimum length of the spring when fully compressed
3. The maximum length of the spring when fully extended
4. The force exerted by the spring, proportional to its displacement from the rest length

Some springs can exert a different force depending on if they are compressed or stretched. The force can also vary in a non-linear relationship with the displacement, but for our purposes, the simple spring described above works well, and is easy to use.

A simple point mass has three parameters
1. The position in space, expressed as a 2D vector
2. Its previous position
2. Its mass

For most of what we are doing, I use a mass of 1.0 for all the point masses. However it’s useful to have a per-point mass, as it makes it easy to try various effects. If you end up with all the masses being the same, then you can obviously optimize this out of the computation.

VERLET MADE EASY

Verlet Integration is a fancy name for a slightly different way of applying the velocity and acceleration of a point to its position. Normally a point P will have a position X and a velocity V. Forces act on the particle, namely gravity, air resistance and the springs. The traditional (non-Verlet) way of updating the position of a particle is to first update the velocity with the acceleration and then update the position with this velocity:

F = total of forces acting on this point
T = Time step to update over
V += T*F/M
X += V*T + F/M/2*T*T

This generally referred to as Euler integration (with a second order Taylor series correction), but you might recognize it as the standard Newtonian equations of motion, (more usually notated as v=u+a*t and s=u*t+1/2*a*t*t). While referring to this as “integration” is technically correct, and will lead to a deeper understanding eventually, don’t worry if you don’t follow what is meant by “integration” – just think in terms of the equations of motion.

Verlet integration is basically another way of performing this physics advancement step. With Verlet we do not store the velocity, instead we store the previous position, and the velocity is implied as the difference of the current position from the previous position. The physics update then becomes:

F = total of forces acting on this point
T = Time step to update over
X0 is the previous position, X1 is the current position
XT = X1
X1 += (X1-X0) + F/M*T*T
X0 = XT

So why use Verlet, well technically using Verlet integration is more accurate than using Euler integration when the forces vary along with position and velocity. The reasons why this is so are a little obscure, and for many practically game purposes this difference in accuracy is not a major issue. The main reason for using Verlet is that it makes it very easy to apply constraints to a physical system. When a point moves past a physical limit (such as one point moving further away from another point than the maximum length of a spring that connects them), then we can simply move the point back to a “safe” position within that length. There is no need to calculate an impulse velocity, as the velocity is implied in the position, and is automatically handled by the movement.

BUILDING A BLOB

Once I got the basic spring-mass system working, I needed to create a blob. I figured that since the natural shape of a body of water is a sphere (when subjected to neutral external forces, such as a rain drop, or a blob of water in zero gravity), then I should start with a circular spring mass system, and then the application of gravity would naturally deform it into a nice blob shape.


Figure 2 – A single skinned blob, suffers from an easily folded skin

So my first attempt (figure 2) was a circle of 20 point masses, each joined to each other , and to a center point by springs. This is a standard N-gon, with the rest lengths of the springs being the natural lengths of the sides of the N-gon. This worked reasonably well for a first pass, and gave me something vaguely blobby that settled into circle under zero gravity, and deformed a bit when resting on the ground under gravity. But it suffered from several problems:
• The “skin” of the blob (the lines around the edge) kept folding over themselves, leading to ugly spikes in the skin.
• The blob was either too wobbly, meaning the edges of the blob wiggled like a giant piece of Jello, or too stiff, meaning it looked like a rubber ball.
• It kept getting caught on things, the outer edges would wrap around a corner of the environment, and the blob would hang there, looking like a dishrag.

My first attempt at solving this was to make the inner springs (the “spokes), have a longer rest length, so they would be under compression, and have the outer springs (the “skin”), have a shorter rest length. I was thinking this would simulate surface tension. However, this did not work very well, the shape did not improve, and the blob tended to violently collapse if gently nudged.

A BETTER BLOB

So I decided I needed a bit more structure to the blobs to make them more stable. After a few more failed experiments I hit upon the solution. Simply give the blob two layers of skin, one inside the other like concentric circles, joined together with a fairly rigid zig-zag set of joints. The inner skin is joined to a central point as before. See figure 3.


Figure 3 – Double skinned blob, the double skin structure provides a very stable skin.

This works remarkably well. I had to tweak the constants a bit, most specifically the number of segments, the thickness of the skin, and the strength of the springs. But quite quickly I had a very realistic acting blob. See figure 4 for the blobs in action

Why does this work so well? A “blob” here is a blob of very thick and slippery liquid, something like mercury. Mercury has a negative coefficient of surface tension, meaning the “skin” of a drop of mercury has very different properties to the interior. I initially though that the increased tension within the skin structure of our new blob was in some way simulating the effects of surface tension. But after looking at it for a while, I saw that the main effect it was having was constraining the curvature of the skin, thus smoothing out the high frequency wobbles we saw earlier. It’s simulating the appearance of surface tension rather than the underlying physics.


Figure 4 – The blob physics in action with various blobs. Download the sample to see for yourself.

BLOB PROBLEMS

I encountered three major problems in implementing this system. Firstly the blobs tended to be unstable, and wobbled all over the screen in the absence of external forces. Secondly, the blobs would get stuck, especially against corners, but also against surfaces. Finally the blob edges tended to get twisted when they impacted the environment at high speed.

The first problem (instability) struck me as a bit odd, since Verlet integration is known for being a bit more stable than Euler integration. This problem had me scratching my head for a while, but I finally figured out that it was due to the order in which I was performing the updates. I was looping over all the points, gathering the forces for a point (from the springs) and then moving the point.

The problem with that was that when a point moved, the force it applied to another point via a spring would change. In the case of two points connected by a single spring, the force exerted by the spring should be symmetrical for each point. However, if you move one point before gathering the force for the second point, then the forces will be different. This causes the spring system to have a net force imbalance in a particular direction (depending on the order of update). The solution here was very simple. I just split the loop up into two separate loops: one to gather the forces, and then one to apply them in the integration step. This ensured that all forces were symmetrical.

The second problem (getting stuck) has to do with the way collisions are handled. Since the collision of a point is implicit in its last movement, then if a collision resolution causes a point not to move very much, then it effectively kills its velocity. For collision resolution of a single point mass to work correctly (i.e. bounce off the surface), the next movement must be of appropriate magnitude and direction so future movement is correct. However, we are not simulating points; we are simulating a blob, so we need to consider the movement of the system as a whole.

With a spring mass system, the compression of the springs can handle the bouncing (to a certain degree). So if the leading edge points of a spring mass system simply stop when they hit a wall, the springs connecting to the points behind them will be compressed, and eventually bounce the whole blob off the wall in a nice convincing manner.

This works fine for something that just bounces up and down, but something hitting a surface at an angle needs to slide along the wall. This was quite easily accomplished with point/surface collisions by simply allowing the point the move parallel to the wall by the distance it would have originally traveled.

Something similar was done with line/surface collisions, but instead of the points moving parallel to the surface, they move parallel to the line. This allows the blob to slide over corners.

These collision resolutions were also where I implemented friction, simply scaling the distanced moved by a constant (like 0.95) gives a relatively pleasing result. You could calculate a friction force to be applied the next frame, but it’s simpler to directly incorporate it into the movement calculation. In the demo the friction can be altered by holding “S” to become slippery, and “A” to become less slippery. Holding “S” will allow you to slip though holes faster.

The final problem was edges getting twisted. This generally happened because a point moved furthur past another point it was supposed to keep away from. Since the spring constrain only measures distance, the point is then pushed away by the spring, but in the wrong direction, causing the edge to become twisted. One it’s twisted, it does not become un-twisted by itself.

The simplest solution, and the one I implement in the demo, is to try to never move in large steps. The easiest way of doing this is to run the physics multiple times with a smaller time-step. In the demo I run it six times per frame. Even so, the blobs can get kinks in them if shaken violently.

Something that exacerbates this problem is increasing the number of segments in a blob. With a large number of segments, the skin edges are much shorter, and so more likely to exceed their constraints in a single iteration. A lower number of segments works better. I found a 40 segment blob was impossible to kink, and yet still looked almost as nice as a 80 segment blob that was much more prone to kinking.

Obviously running the physics multiple times is not ideal, as it can be quite expensive. A better solution would be to simply ensure the kinking does not happen in the first place. Perhaps by adding some kind of angular constraint to a point on the surface. Another alternative is to link surface points to their second neighbors with a rigid constraint, so if the point gets past the first neighbor, then the second neighbor will push it back into the correct position. This type of second-neighbor constraint is commonly found in cloth simulation.

RESOURCES

Thomas Jakobsen, Advanced Character Physics, Gamasutra 2001, http://www.gamasutra.com/resource_guide/20030121/jacobson_01.shtml

Erleben, et al, Physics Based Animation, 2005, Charles River Media, Chapter 8, p265.

Chronic Logic, Gish Demo, http://www.chroniclogic.com/index.htm?gish.htm

23 Comments »

  1. Interesting. I note that you wrote an article entitled ‘managed code in games’, Jan 07 (not published on your site yet), where you mentioned that you converted this example to managed code. Are you going to release that code for download as well?

    Comment by Washu — February 7, 2007 @ 1:07 pm

  2. It’s the same code (Managed C++, not C#), all I did was specify the /clr option and toggle the few incompatible optimization options.

    Comment by Mick West — February 7, 2007 @ 1:47 pm

  3. So you set the /clr compiler flag, and adjusted the incompatible compilation settings (/MT -> /MD)…

    That’s really not a valid comparison of managed code performance to unmanaged code performance. While it is true that you will generate IL when you switch to that flag, the IL generated attempts to mirror the patterns and principles that you would use in an unmanaged application, not those of a managed application. If you really wanted to do such a comparison, the least you could have done was rewritten it in C++/CLI, utilizing the .Net framework and design principles behind managed applications. As it is, all you’ve really done is demonstrate why you don’t just turn on /clr wily nily.

    What really annoys me is that now developers are going to read this article and will use it as a justification for claiming that managed code is slow, even though the performance metric used was not an appropriate one at all. As with any platform comparison unless you write it for the platform such that both samples utilize the platform as it was meant to be used, the results will be highly biased towards whichever platform the sample’s design favors.

    Comment by Washu — February 7, 2007 @ 2:39 pm

  4. Managed code is slow for this type of thing. The code is really very simple, it iterates over a large number of small atomic objects and then perform fairly simple operations on them. That’s all it does. It’s low level code that needs to be compiled in a way that takes advantage of the hardware. “Utilizing the .NET framework” it not going to help.

    The whole point of my Managed Code article was that certain core functionality inherent in game engines is not suitable for IL/CLR compilation. It’s not fast enough.

    My article is probably viewed as heretical by most game developers. I’m actually encouraging the use of managed code (for non-engine components).

    Comment by Mick West — February 7, 2007 @ 4:39 pm

  5. Managed code is slow for this type of thing. The code is really very simple, it iterates over a large number of small atomic objects and then perform fairly simple operations on them. That’s all it does. It’s low level code that needs to be compiled in a way that takes advantage of the hardware. “Utilizing the .NET framework” it not going to help.

    As I’ve already pointed out, your comparison is not valid in the slightest. In case you hadn’t noticed, it was force to convert every piece of the standard library that you touched to IL. It also had to generate a huge number of wrapper and utility functions in order to properly compile your application into IL. Designing the application with managed languages in mind would have avoided all of that and easily increased the performance of the application.

    The whole point of my Managed Code article was that certain core functionality inherent in game engines is not suitable for IL/CLR compilation. It’s not fast enough.

    That’s an interesting point, but it’s only a good point if you can back it up with real evidence. The example you’ve shown doesn’t actually utilize the power of managed code, so its not a very good example of what managed code is capable with in those tight inner loops. For example, your code requires the generation of IL for every standard library template instantiation that you’ve used. In the end, the IL generated is not very good at all, with it being forced to generate large numbers of temporary objects and other messy problems. In short, your program is far from an optimal managed application in most aspects.

    My article is probably viewed as heretical by most game developers. I’m actually encouraging the use of managed code (for non-engine components).

    Yes, and C was also viewed as heretical at one point in time. Then C++ was the next victim. Java didn’t pan out, so it was rightfully regarded as suspect, although even there we can see some large successes in the mobile field. Note the use of scripting languages exploded not too long ago, but even there many regard them as suspect.

    Note that I’m not saying that managed code doesn’t have an inherit amount of overhead to it, however the performance overhead that your article implies with its frame-rate comparison is extremely ludicrous.

    Comment by Washu — February 7, 2007 @ 6:06 pm

  6. Note that I’m not saying that managed code doesn’t have an inherit amount of overhead to it, however the performance overhead that your article implies with its frame-rate comparison is extremely ludicrous.

    There’s really only three things going here in my code.

    1 – Iterating over various STL Vectors of various abstract classes
    2 – Calling a virtual function on each abstract class
    3 – Performing a bunch of math and/or some simple collision detection on each object.

    Now, I’ll freely admit I don’t know if the first two could have been implemented in some different “managed” way that would have been faster. Quite possibly, and I’d be happy to hear about it. It’s code like:
    // apply all the constraints
    void CVerletPoint::SatisfyConstraints()
    {

    vector::iterator i;
    for (i = mp_constraints.begin(); i != mp_constraints.end(); i++)
    {
    (*i)->Satisfy(this);
    }

    }

    But I think the code that accounts for much of the slowdown is the actual per-object code that is executed, which is pretty straightforward stuff. Looking at the CLR generated asm, you get a lot of memory accesses, and you don’t get the SSE2 optimization (the two things probably being somewhat related, with SSE2 giving you a lot more registers to play with).

    Again, the intent of my article was to encourage people to use managed code. I’m all for it. But the best way is for them to use it where the performance overheads are not an issue. Eventually the balance will shift and more and more code will be managed. But right now, the CLR compiler was just not doing a good job.

    Comment by Mick West — February 7, 2007 @ 9:03 pm

  7. To give you an idea of how bad just compiling it with /clr is, here’s that SAME code as the CLR sees it:
    public private: static void __gc* modopt(System::Runtime::CompilerServices::CallConvThiscall __gc*) CVerletPoint::SatisfyConstraints(CVerletPoint __gc** modopt(System::Runtime::CompilerServices::IsConst __gc*) modopt(System::Runtime::CompilerServices::IsConst __gc*) local6)
    {
    std::_Vector_iterator > __gc* t *> >1;
    CVerletPoint __gc** modopt(System::Runtime::CompilerServices::IsConst __gc*) modopt(System::Runtime::CompilerServices::IsConst __gc*) local3 = (local6 + 28);
    std::vector > __gc** modopt(System::Runtime::CompilerServices::IsConst __gc*) modopt(System::Runtime::CompilerServices::IsConst __gc*) local2 = local3;
    CVerletConstraint __gc*** constraintPtrPtr2 = local2[4];
    if (constraintPtrPtr2 > local2[8])
    {
    ::_invalid_parameter_noinfo();
    }
    *(*static_cast(&t *> >1)) = local2;
    *(*static_cast((&t *> >1 + 4))) = constraintPtrPtr2;
    std::_Vector_iterator > __gc* t *> >2 = t *> >1;
    std::vector > __gc** modopt(System::Runtime::CompilerServices::IsConst __gc*) modopt(System::Runtime::CompilerServices::IsConst __gc*) local1 = local3;
    std::vector > __gc** modopt(System::Runtime::CompilerServices::IsConst __gc*) modopt(System::Runtime::CompilerServices::IsConst __gc*) local5 = (local1 + 4);
    std::vector > __gc** modopt(System::Runtime::CompilerServices::IsConst __gc*) modopt(System::Runtime::CompilerServices::IsConst __gc*) local4 = (local1 + 8);
    while (true)
    {
    CVerletConstraint __gc*** constraintPtrPtr1 = local4[0];
    if (local5[0] > constraintPtrPtr1)
    {
    ::_invalid_parameter_noinfo();
    }
    if ((*(*static_cast(&t *> >2)) == 0) || (*(*static_cast(&t *> >2)) != local1))
    {
    ::_invalid_parameter_noinfo();
    }
    if (*static_cast((*static_cast((*(*static_cast((&t *> >2 + 4))) == constraintPtrPtr1)) == 0)) == 0)
    {
    return;
    }
    if (*(*static_cast(&t *> >2)) == 0)
    {
    ::_invalid_parameter_noinfo();
    }
    System::Int32 __gc* num2 = (*(*static_cast(&t *> >2)) + 8);
    if (*(*static_cast((&t *> >2 + 4))) >= num2[0])
    {
    ::_invalid_parameter_noinfo();
    }
    System::Int32 __gc* num1 = *(*(*static_cast((&t *> >2 + 4))));
    **(*(num1))(num1, local6);
    if (*(*static_cast((&t *> >2 + 4))) >= num2[0])
    {
    ::_invalid_parameter_noinfo();
    }
    *(*static_cast((&t *> >2 + 4))) += 4;
    }
    }

    The equivelent system (mostly, I’m still trying to dig through this horrible mess you call code) in MC++
    public: void __gc* SatisfyConstraint()
    {
    foreach (Kent::VerletSample::VerletConstraint __gc* constraint1 in this->constraints)
    {
    constraint1->Satisfy(this);
    }
    }

    And, just for comparison, the IL generated for each:
    Yours:

    .method assembly static void modopt([mscorlib]System.Runtime.CompilerServices.CallConvThiscall) CVerletPoint.SatisfyConstraints(CVerletPoint* modopt([mscorlib]System.Runtime.CompilerServices.IsConst) modopt([mscorlib]System.Runtime.CompilerServices.IsConst)) cil managed
    {
    .maxstack 3
    .locals (
    [0] std.vector >* modopt([mscorlib]System.Runtime.CompilerServices.IsConst) modopt([mscorlib]System.Runtime.CompilerServices.IsConst) local1,
    [1] std.vector >* modopt([mscorlib]System.Runtime.CompilerServices.IsConst) modopt([mscorlib]System.Runtime.CompilerServices.IsConst) local2,
    [2] int32 num1,
    [3] int32 num2,
    [4] CVerletConstraint** constraintPtrPtr1,
    [5] CVerletConstraint** constraintPtrPtr2,
    [6] CVerletPoint* modopt([mscorlib]System.Runtime.CompilerServices.IsConst) modopt([mscorlib]System.Runtime.CompilerServices.IsConst) local3,
    [7] std.vector >* modopt([mscorlib]System.Runtime.CompilerServices.IsConst) modopt([mscorlib]System.Runtime.CompilerServices.IsConst) local4,
    [8] std.vector >* modopt([mscorlib]System.Runtime.CompilerServices.IsConst) modopt([mscorlib]System.Runtime.CompilerServices.IsConst) local5,
    [9] std._Vector_iterator > t *> >1,
    [10] std._Vector_iterator > t *> >2)
    L_0000: ldarg.0
    L_0001: ldc.i4.s 28
    L_0003: add
    L_0004: stloc.s local3
    L_0006: ldloc.s local3
    L_0008: stloc.1
    L_0009: ldloc.1
    L_000a: ldc.i4.4
    L_000b: add
    L_000c: ldind.i4
    L_000d: stloc.s constraintPtrPtr2
    L_000f: ldloc.s constraintPtrPtr2
    L_0011: ldloc.1
    L_0012: ldc.i4.8
    L_0013: add
    L_0014: ldind.i4
    L_0015: ble.un.s L_001c
    L_0017: call void modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) ::_invalid_parameter_noinfo()
    L_001c: ldloca.s t *> >1
    L_001e: ldloc.1
    L_001f: stind.i4
    L_0020: ldloca.s t *> >1
    L_0022: ldc.i4.4
    L_0023: add
    L_0024: ldloc.s constraintPtrPtr2
    L_0026: stind.i4
    L_0027: ldloc.s t *> >1
    L_0029: stloc.s t *> >2
    L_002b: ldloc.s local3
    L_002d: stloc.0
    L_002e: ldloc.0
    L_002f: ldc.i4.4
    L_0030: add
    L_0031: stloc.s local5
    L_0033: ldloc.0
    L_0034: ldc.i4.8
    L_0035: add
    L_0036: stloc.s local4
    L_0038: ldloc.s local4
    L_003a: ldind.i4
    L_003b: stloc.s constraintPtrPtr1
    L_003d: ldloc.s local5
    L_003f: ldind.i4
    L_0040: ldloc.s constraintPtrPtr1
    L_0042: ble.un.s L_0049
    L_0044: call void modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) ::_invalid_parameter_noinfo()
    L_0049: ldloca.s t *> >2
    L_004b: ldind.i4
    L_004c: brfalse.s L_0054
    L_004e: ldloca.s t *> >2
    L_0050: ldind.i4
    L_0051: ldloc.0
    L_0052: beq.s L_0059
    L_0054: call void modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) ::_invalid_parameter_noinfo()
    L_0059: ldloca.s t *> >2
    L_005b: ldc.i4.4
    L_005c: add
    L_005d: ldind.i4
    L_005e: ldloc.s constraintPtrPtr1
    L_0060: ceq
    L_0062: conv.u1
    L_0063: ldc.i4.0
    L_0064: ceq
    L_0066: conv.u1
    L_0067: brfalse.s L_00b4
    L_0069: ldloca.s t *> >2
    L_006b: ldind.i4
    L_006c: brtrue.s L_0073
    L_006e: call void modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) ::_invalid_parameter_noinfo()
    L_0073: ldloca.s t *> >2
    L_0075: ldind.i4
    L_0076: ldc.i4.8
    L_0077: add
    L_0078: stloc.3
    L_0079: ldloca.s t *> >2
    L_007b: ldc.i4.4
    L_007c: add
    L_007d: ldind.i4
    L_007e: ldloc.3
    L_007f: ldind.i4
    L_0080: blt.un.s L_0087
    L_0082: call void modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) ::_invalid_parameter_noinfo()
    L_0087: ldloca.s t *> >2
    L_0089: ldc.i4.4
    L_008a: add
    L_008b: ldind.i4
    L_008c: ldind.i4
    L_008d: stloc.2
    L_008e: ldloc.2
    L_008f: ldarg.0
    L_0090: ldloc.2
    L_0091: ldind.i4
    L_0092: ldind.i4
    L_0093: calli method unmanaged thiscall void modopt([mscorlib]System.Runtime.CompilerServices.CallConvThiscall) *(native int, CVerletPoint*)
    L_0098: ldloca.s t *> >2
    L_009a: ldc.i4.4
    L_009b: add
    L_009c: ldind.i4
    L_009d: ldloc.3
    L_009e: ldind.i4
    L_009f: blt.un.s L_00a6
    L_00a1: call void modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) ::_invalid_parameter_noinfo()
    L_00a6: ldloca.s t *> >2
    L_00a8: ldc.i4.4
    L_00a9: add
    L_00aa: ldloca.s t *> >2
    L_00ac: ldc.i4.4
    L_00ad: add
    L_00ae: ldind.i4
    L_00af: ldc.i4.4
    L_00b0: add
    L_00b1: stind.i4
    L_00b2: br.s L_0038
    L_00b4: ret
    }

    Mine:

    .method public hidebysig instance void SatisfyConstraint() cil managed
    {
    .maxstack 2
    .locals init (
    [0] Kent.VerletSample.VerletConstraint constraint1,
    [1] [mscorlib]System.Collections.Generic.List`1/Enumerator enumerator1,
    [2] bool flag1)
    L_0000: nop
    L_0001: nop
    L_0002: ldarg.0
    L_0003: ldfld [mscorlib]System.Collections.Generic.List`1 Kent.VerletSample.VerletPoint::constraints
    L_0008: callvirt instance [mscorlib]System.Collections.Generic.List`1/Enumerator [mscorlib]System.Collections.Generic.List`1::GetEnumerator()
    L_000d: stloc.1
    L_000e: br.s L_0022
    L_0010: ldloca.s enumerator1
    L_0012: call instance !0 [mscorlib]System.Collections.Generic.List`1/Enumerator::get_Current()
    L_0017: stloc.0
    L_0018: nop
    L_0019: ldloc.0
    L_001a: ldarg.0
    L_001b: callvirt instance void Kent.VerletSample.VerletConstraint::Satisfy(Kent.VerletSample.VerletPoint)
    L_0020: nop
    L_0021: nop
    L_0022: ldloca.s enumerator1
    L_0024: call instance bool [mscorlib]System.Collections.Generic.List`1/Enumerator::MoveNext()
    L_0029: stloc.2
    L_002a: ldloc.2
    L_002b: brtrue.s L_0010
    L_002d: leave.s L_003e
    L_002f: ldloca.s enumerator1
    L_0031: constrained [mscorlib]System.Collections.Generic.List`1/Enumerator
    L_0037: callvirt instance void [mscorlib]System.IDisposable::Dispose()
    L_003c: nop
    L_003d: endfinally
    L_003e: nop
    L_003f: ret
    .try L_000e to L_002f finally handler L_002f to L_003e
    }

    Can you see my point now? Your code does not translate nicely into a managed environment because it does not utilize managed methodologies. Of course its going to perform horribly. That type of a mess is all throughout the entire code base.

    Now, I don’t expect people to make the next HL2 in managed code, at this point in time that’s not a realistic goal, that extra 10% that managed code ACTUALLY costs you (which is actually even misleading, because many managed methodologies are faster than their unmanaged counterparts) can sometimes make the difference. But in reality, that 10% for most games isn’t that important. Especially with the current trend of pushing rediculous graphics and little to no story/ai/playability.

    Comment by Washu — February 7, 2007 @ 10:04 pm

  8. Okay, so I just recompiled a messy C++ tech demo with /clr, and that’s not representative of what I would have got had I recoded the entire thing in a managed language. This greatly inflated the drop in framerate, and might put some people off looking at using managed code in games. Sorry about that.

    Are you re-writing the whole thing? I’d be interested to see what happens.

    Other people agree with you:
    http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1160556&SiteID=1

    Jim Perry says:
    Just finished reading the article. Very cool that game development using managed code is getting some exposure. As for the author’s test, he simply recompiled native C++ code using the CLR switch. Probably not the best example, depending on how his code was setup. A better example would be to see the same project coded from the ground up using a managed language. Who knows what the compiler was doing just by setting that switch.

    So my example was off the mark. But the gist of the article still stands. Managed code is great for rapid development, it’s more solid and maintainable. Some games can be written entirely in managed code since performance is not always an issue. Game engine components suffer some degradation in performance with the current CLR, but you can divided up your code into managed and unmanaged parts if this is an issue.

    Another relevant quote (and a very interesting page in general, with good links):
    http://blogs.msdn.com/netcfteam/archive/2006/12/22/managed-code-performance-on-xbox-360-for-the-xna-framework-1-0.aspx

    Expect to get less floating point throughput than you would from identical managed code running on the desktop CLR on a good off the shelf PC. How much less depends on your code so it’s hard to provide a meaningful ratio.

    I asked someone in the Microsoft Game Technology group if they agreed with the way I split the tasks and he said:

    I agree with the division. I would think that most developers would, although the exact positioning of the transition point along the continuum will continue to be a hot debate topic (and should gradually move towards the managed end of the spectrum).

    Side note: the Managed Code article is available online in the sample Game Developer:
    http://gamedeveloper.texterity.com/gamedeveloper/sample/?pg=41

    Comment by Mick West — February 8, 2007 @ 8:31 am

  9. And regarding that IL you posted above, a good part of that is the STL parameter checking which is really inappropriate for optimized release mode code in this type of application, and can be removed with:

    #define _SECURE_SCL 0

    The resultant IL then looks like:

    ldloca.s 2 ; _i$
    ldarg.0 ; _this$
    ldc.i4.s 32 ; i32 0x20
    add
    ldind.i4
    stind.i4
    ldarg.0 ; _this$
    ldc.i4.s 36 ; i32 0x24
    add
    stloc.1 ; $T98077
    $LL20@SatisfyCon:
    ldloca.s 2 ; _i$
    ldind.i4
    ldloc.1 ; $T98077
    ldind.i4
    ceq
    conv.u1
    ldc.i.0 0 ; u8 0x0
    ceq
    conv.u1
    brfalse.s $LN1@SatisfyCon
    ldloca.s 2 ; _i$
    ldind.i4
    ldind.i4
    stloc.0
    ldloc.0
    ldarg.0 ; _this$
    ldloc.0
    ldind.i4
    ldind.i4
    calli D:V(II)
    ldloca.s 2 ; _i$
    ldloca.s 2 ; _i$
    ldind.i4
    ldc.i.4 4 ; i32 0x4
    add
    stind.i4
    br.s $LL20@SatisfyCon
    $LN1@SatisfyCon:
    ret

    Which on first glance looks like it might be simpler than your version (longer, but no system calls).

    This made the /clr version go from 50fps to 60fps, and the unmanaged version from 150fps to
    180. Woot. But still the same slowdown for managed vs. unmanaged.

    I think I’ll look into this a bit more and see what is going on.

    Comment by Mick West — February 8, 2007 @ 10:49 am

  10. Okay, I feel that the above shows that the iteration overhead is not as great as I might have though. Now much of my code is doing floating point vector calculations like this:


    Vector2 CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
    {
    Vector2 to_me = p_verlet->GetPos() - mp_other_verlet->GetPos();
    if (to_me.Length() < 0.000001) { to_me = Vector2(1.0f,0.0f); } Vector2 mid = mp_other_verlet->GetPos() + to_me.Normal()*m_mid;
    Vector2 to_mid = mid-p_verlet->GetPos() ;
    return to_mid*m_force;
    }

    Which in my unmanaged version compiles as:

    Vector2 CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
    {
    004062F0 push ebp
    004062F1 mov ebp,esp
    004062F3 and esp,0FFFFFFF8h
    Vector2 to_me = p_verlet->GetPos() - mp_other_verlet->GetPos();
    004062F6 mov edx,dword ptr [ebp+0Ch]
    004062F9 mov eax,dword ptr [ecx+4]
    004062FC movss xmm0,dword ptr [eax]
    00406300 movss xmm1,dword ptr [eax+4]
    00406305 movss xmm4,dword ptr [edx]
    00406309 movss xmm5,dword ptr [edx+4]
    if (to_me.Length() < 0.000001) 0040630E xorps xmm6,xmm6 00406311 movaps xmm2,xmm4 00406314 subss xmm2,xmm0 00406318 movaps xmm3,xmm5 0040631B subss xmm3,xmm1 0040631F movaps xmm0,xmm3 00406322 mulss xmm0,xmm3 00406326 movaps xmm1,xmm2 00406329 mulss xmm1,xmm2 0040632D addss xmm0,xmm1 00406331 movsd xmm1,mmword ptr [__real@3eb0c6f7a0b5ed8d (418060h)] 00406339 sqrtss xmm0,xmm0 0040633D sub esp,8 00406340 cvtps2pd xmm0,xmm0 00406343 comisd xmm1,xmm0 00406347 movss xmm1,dword ptr [__real@3f800000 (41803Ch)] 0040634F jbe CSemiRigidConstraint::GetForce+67h (406357h) { to_me = Vector2(1.0f,0.0f); 00406351 movaps xmm2,xmm1 00406354 movaps xmm3,xmm6 } Vector2 mid = mp_other_verlet->GetPos() + to_me.Normal()*m_mid;
    00406357 movaps xmm0,xmm3
    0040635A mulss xmm0,xmm3
    0040635E movaps xmm7,xmm2
    00406361 mulss xmm7,xmm2
    00406365 addss xmm0,xmm7
    00406369 sqrtss xmm0,xmm0
    0040636D comiss xmm0,xmm6
    00406370 jbe CSemiRigidConstraint::GetForce+93h (406383h)
    00406372 divss xmm1,xmm0
    00406376 movaps xmm0,xmm1
    00406379 mulss xmm0,xmm2
    0040637D mulss xmm1,xmm3
    00406381 jmp CSemiRigidConstraint::GetForce+9Eh (40638Eh)
    00406383 movss xmm1,dword ptr [esp+4]
    00406389 movss xmm0,dword ptr [esp]
    0040638E movss xmm2,dword ptr [ecx+0Ch]
    00406393 mulss xmm1,xmm2
    00406397 mulss xmm0,xmm2
    0040639B movss xmm2,dword ptr [eax+4]
    004063A0 movaps xmm3,xmm1
    004063A3 movss xmm1,dword ptr [eax]
    Vector2 to_mid = mid-p_verlet->GetPos() ;
    return to_mid*m_force;
    004063A7 mov eax,dword ptr [ebp+8]
    004063AA addss xmm1,xmm0
    004063AE addss xmm2,xmm3
    004063B2 movaps xmm0,xmm4
    004063B5 subss xmm1,xmm0
    004063B9 movss xmm0,dword ptr [ecx+14h]
    004063BE movaps xmm3,xmm5
    004063C1 subss xmm2,xmm3
    004063C5 mulss xmm1,xmm0
    004063C9 mulss xmm2,xmm0
    004063CD movss dword ptr [eax],xmm1
    004063D1 movss dword ptr [eax+4],xmm2
    }
    004063D6 mov esp,ebp
    004063D8 pop ebp
    004063D9 ret 8

    Now that’s using SIMD extensions, we we can’t in CLR, so let’s turn them off:

    Vector2 CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
    {
    Vector2 to_me = p_verlet->GetPos() - mp_other_verlet->GetPos();
    00405860 mov edx,dword ptr [ecx+4]
    00405863 fld dword ptr [edx]
    00405865 sub esp,8
    00405868 fld dword ptr [edx+4]
    0040586B push esi
    0040586C mov esi,dword ptr [esp+14h]
    00405870 fld dword ptr [esi]
    00405872 fsubrp st(2),st
    00405874 fsubr dword ptr [esi+4]
    if (to_me.Length() < 0.000001) 00405877 fld st(0) 00405879 fmul st,st(1) 0040587B fld st(2) 0040587D fmul st,st(3) 0040587F faddp st(1),st 00405881 fsqrt 00405883 fcomp qword ptr [__real@3eb0c6f7a0b5ed8d (414508h)] 00405889 fnstsw ax 0040588B fld1 0040588D test ah,5 00405890 fldz 00405892 jp CSemiRigidConstraint::GetForce+46h (4058A6h) { to_me = Vector2(1.0f,0.0f); 00405894 fstp st(3) 00405896 fstp st(1) 00405898 fld st(0) 0040589A fld st(2) 0040589C fxch st(1) 0040589E fxch st(3) 004058A0 fxch st(1) 004058A2 fxch st(2) 004058A4 fxch st(1) } Vector2 mid = mp_other_verlet->GetPos() + to_me.Normal()*m_mid;
    004058A6 fld st(2)
    004058A8 fmul st,st(3)
    004058AA fld st(4)
    004058AC fmul st,st(5)
    004058AE faddp st(1),st
    004058B0 fsqrt
    004058B2 fcom st(1)
    004058B4 fnstsw ax
    004058B6 fstp st(1)
    004058B8 test ah,41h
    004058BB jne CSemiRigidConstraint::GetForce+67h (4058C7h)
    004058BD fdivp st(1),st
    004058BF fld st(0)
    004058C1 fmulp st(3),st
    004058C3 fmulp st(1),st
    004058C5 jmp CSemiRigidConstraint::GetForce+77h (4058D7h)
    004058C7 fstp st(0)
    004058C9 fstp st(2)
    004058CB fstp st(0)
    004058CD fstp st(0)
    004058CF fld dword ptr [esp+4]
    004058D3 fld dword ptr [esp+8]
    004058D7 fld dword ptr [ecx+0Ch]
    Vector2 to_mid = mid-p_verlet->GetPos() ;
    return to_mid*m_force;
    004058DA mov eax,dword ptr [esp+10h]
    004058DE fmul st(2),st
    004058E0 fmulp st(1),st
    004058E2 fld dword ptr [edx]
    004058E4 faddp st(2),st
    004058E6 fadd dword ptr [edx+4]
    004058E9 fld dword ptr [esi]
    004058EB fld dword ptr [esi+4]
    004058EE pop esi
    004058EF fxch st(3)
    004058F1 fsubrp st(1),st
    004058F3 fxch st(1)
    004058F5 fsubrp st(2),st
    004058F7 fld dword ptr [ecx+14h]
    004058FA fmul st(1),st
    004058FC fxch st(1)
    004058FE fstp dword ptr [eax]
    00405900 fmulp st(1),st
    00405902 fstp dword ptr [eax+4]
    }
    00405905 add esp,8
    00405908 ret 8

    Similar size, just doing the FP on the FP stack rather than in SIMD registers

    Now look at the /clr version:

    Vector2 CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
    {
    Vector2 to_me = p_verlet->GetPos() - mp_other_verlet->GetPos();
    00000000 push ebp
    00000001 mov ebp,esp
    00000003 push edi
    00000004 push esi
    00000005 push ebx
    00000006 sub esp,94h
    0000000c mov esi,ecx
    0000000e mov edi,edx
    00000010 cmp dword ptr ds:[006C2DC8h],0
    00000017 je 0000001E
    00000019 call 78DE2926
    0000001e fldz
    00000020 fstp dword ptr [ebp-10h]
    00000023 xor ebx,ebx
    00000025 fldz
    00000027 fstp dword ptr [ebp-14h]
    0000002a fldz
    0000002c fstp dword ptr [ebp-18h]
    0000002f fldz
    00000031 fstp dword ptr [ebp-1Ch]
    00000034 fldz
    00000036 fstp dword ptr [ebp-20h]
    00000039 fldz
    0000003b fstp dword ptr [ebp-24h]
    0000003e xor edx,edx
    00000040 mov dword ptr [ebp-28h],edx
    00000043 xor edx,edx
    00000045 mov dword ptr [ebp-2Ch],edx
    00000048 fldz
    0000004a fstp dword ptr [ebp-30h]
    0000004d fldz
    0000004f fstp dword ptr [ebp-34h]
    00000052 fldz
    00000054 fstp dword ptr [ebp-38h]
    00000057 fldz
    00000059 fstp dword ptr [ebp-3Ch]
    0000005c fldz
    0000005e fstp dword ptr [ebp-40h]
    00000061 fldz
    00000063 fstp dword ptr [ebp-44h]
    00000066 fldz
    00000068 fstp dword ptr [ebp-48h]
    0000006b fldz
    0000006d fstp dword ptr [ebp-4Ch]
    00000070 fldz
    00000072 fstp dword ptr [ebp-50h]
    00000075 fldz
    00000077 fstp dword ptr [ebp-54h]
    0000007a mov eax,dword ptr [esi+4]
    0000007d mov dword ptr [ebp-2Ch],eax
    00000080 mov eax,dword ptr [ebp-2Ch]
    00000083 mov dword ptr [ebp-28h],eax
    00000086 mov eax,dword ptr [ebp-28h]
    00000089 fld dword ptr [eax]
    0000008b fstp dword ptr [ebp+FFFFFF7Ch]
    00000091 mov eax,dword ptr [ebp-28h]
    00000094 fld dword ptr [eax+4]
    00000097 fstp dword ptr [ebp-80h]
    0000009a mov eax,dword ptr [ebp+8]
    0000009d fld dword ptr [eax]
    0000009f fstp dword ptr [ebp-7Ch]
    000000a2 mov eax,dword ptr [ebp+8]
    000000a5 fld dword ptr [eax+4]
    000000a8 fstp dword ptr [ebp-78h]
    000000ab fld dword ptr [ebp-7Ch]
    000000ae fsub dword ptr [ebp+FFFFFF7Ch]
    000000b4 fstp dword ptr [ebp-24h]
    000000b7 fld dword ptr [ebp-78h]
    000000ba fsub dword ptr [ebp-80h]
    000000bd fstp dword ptr [ebp-20h]
    000000c0 fld dword ptr [ebp-24h]
    000000c3 fstp dword ptr [ebp-74h]
    000000c6 fld dword ptr [ebp-20h]
    000000c9 fstp dword ptr [ebp-70h]
    if (to_me.Length() < 0.000001) 000000cc fld dword ptr [ebp-20h] 000000cf fmul st,st(0) 000000d1 fld dword ptr [ebp-24h] 000000d4 fmul st,st(0) 000000d6 faddp st(1),st 000000d8 fstp dword ptr [ebp-54h] 000000db fld dword ptr [ebp-54h] 000000de fsqrt 000000e0 fstp qword ptr [ebp+FFFFFF74h] 000000e6 fld qword ptr [ebp+FFFFFF74h] 000000ec fstp dword ptr [ebp+FFFFFF68h] 000000f2 fld dword ptr [ebp+FFFFFF68h] 000000f8 fstp qword ptr [ebp+FFFFFF60h] 000000fe fld qword ptr [ebp+FFFFFF60h] 00000104 fld qword ptr ds:[012AFBD0h] 0000010a fcomip st,st(1) 0000010c fstp st(0) 0000010e jp 0000011C 00000110 jbe 0000011C { to_me = Vector2(1.0f,0.0f); 00000112 fld1 00000114 fstp dword ptr [ebp-74h] 00000117 fldz 00000119 fstp dword ptr [ebp-70h] } Vector2 mid = mp_other_verlet->GetPos() + to_me.Normal()*m_mid;
    0000011c fld dword ptr [ebp-70h]
    0000011f fmul st,st(0)
    00000121 fld dword ptr [ebp-74h]
    00000124 fmul st,st(0)
    00000126 faddp st(1),st
    00000128 fstp dword ptr [ebp-50h]
    0000012b fld dword ptr [ebp-50h]
    0000012e fsqrt
    00000130 fstp qword ptr [ebp+FFFFFF6Ch]
    00000136 fld qword ptr [ebp+FFFFFF6Ch]
    0000013c fstp dword ptr [ebp-1Ch]
    0000013f fld dword ptr [ebp-1Ch]
    00000142 fldz
    00000144 fcomip st,st(1)
    00000146 fstp st(0)
    00000148 jp 00000168
    0000014a jae 00000168
    0000014c fld dword ptr [ebp-1Ch]
    0000014f fld1
    00000151 fdivrp st(1),st
    00000153 fstp dword ptr [ebp-18h]
    00000156 fld dword ptr [ebp-18h]
    00000159 fmul dword ptr [ebp-74h]
    0000015c fstp dword ptr [ebp-6Ch]
    0000015f fld dword ptr [ebp-18h]
    00000162 fmul dword ptr [ebp-70h]
    00000165 fstp dword ptr [ebp-68h]
    00000168 fld dword ptr [esi+0Ch]
    0000016b fstp dword ptr [ebp-14h]
    0000016e fld dword ptr [ebp-6Ch]
    00000171 fmul dword ptr [ebp-14h]
    00000174 fstp dword ptr [ebp-4Ch]
    00000177 fld dword ptr [ebp-68h]
    0000017a fmul dword ptr [ebp-14h]
    0000017d fstp dword ptr [ebp-48h]
    00000180 mov ebx,dword ptr [ebp-2Ch]
    00000183 fld dword ptr [ebx]
    00000185 fstp dword ptr [ebp-64h]
    00000188 fld dword ptr [ebx+4]
    0000018b fstp dword ptr [ebp-60h]
    0000018e fld dword ptr [ebp-64h]
    00000191 fadd dword ptr [ebp-4Ch]
    00000194 fstp dword ptr [ebp-44h]
    00000197 fld dword ptr [ebp-60h]
    0000019a fadd dword ptr [ebp-48h]
    0000019d fstp dword ptr [ebp-40h]
    Vector2 to_mid = mid-p_verlet->GetPos() ;
    000001a0 mov eax,dword ptr [ebp+8]
    000001a3 fld dword ptr [eax]
    000001a5 fstp dword ptr [ebp-5Ch]
    000001a8 mov eax,dword ptr [ebp+8]
    000001ab fld dword ptr [eax+4]
    000001ae fstp dword ptr [ebp-58h]
    000001b1 fld dword ptr [ebp-44h]
    000001b4 fsub dword ptr [ebp-5Ch]
    000001b7 fstp dword ptr [ebp-3Ch]
    000001ba fld dword ptr [ebp-40h]
    000001bd fsub dword ptr [ebp-58h]
    000001c0 fstp dword ptr [ebp-38h]
    return to_mid*m_force; //
    000001c3 fld dword ptr [esi+14h]
    000001c6 fstp dword ptr [ebp-10h]
    000001c9 fld dword ptr [ebp-3Ch]
    000001cc fmul dword ptr [ebp-10h]
    000001cf fstp dword ptr [ebp-34h]
    000001d2 fld dword ptr [ebp-38h]
    000001d5 fmul dword ptr [ebp-10h]
    000001d8 fstp dword ptr [ebp-30h]
    000001db fld dword ptr [ebp-34h]
    000001de fstp dword ptr [edi]
    000001e0 fld dword ptr [ebp-30h]
    000001e3 fstp dword ptr [edi+4]
    000001e6 mov eax,edi
    000001e8 lea esp,[ebp-0Ch]
    000001eb pop ebx
    000001ec pop esi
    000001ed pop edi
    000001ee pop ebp
    000001ef ret 4

    Even just glancing over it, it look like it would take twice as long, possible more depending on the memory cache situation. Is this just bad CLR compiling, or am I doing something wrong?

    Comment by Mick West — February 8, 2007 @ 11:27 am

  11. I do appologize for the spam. It would appear that there is a limit to the length of a reply I can send, so I’ve moved the comparisons off to a file on my site.
    It’s a bit of both, in reality. The CLR can’t currently take good advantage of SIMD instructions yet, so that’s obviously a ding against it. But the C++/CLI compiler is not the brightest when dealing with doing an unmanaged -> managed conversion.

    The above contains a naïve implementation of your function, on the left. It is significantly shorter than the one above that the C++/CLI compiler generated, still not that great though. So what kinds of problems are there? Well, there is a lot of excessive copying. Operators like – generate temporaries…so if we write like a managed person in a performant environment would, mind you, not completely sacrificing readability…which you can see on the right.

    Mind you, i could get the right one down to about the same size of yours, without SIMD benefits, but it sacrifices readability.

    Comment by Washu — February 8, 2007 @ 1:57 pm

  12. I take it your code got chopped off at the <, maybe not closing both your <code;> tags?

    But I get your point, you can massage the code using what are essentially intrinsics, to get the JIT compiler to spit out code that equals or betters the performance of native compiled C++ code (at least in a straightforward situation like this).

    There are a number of calls in your code, which helps make it shorter. Of course size is not so important here – it’s how fast it runs.

    And it seems a little counterintuitive, to be having to hack away at code to massage the resultant assembly.

    Comment by Mick West — February 8, 2007 @ 2:13 pm

  13. And now I’ve seen your whole code (nice comparison BTW), it’s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting. Whereas the C++ version just calls sqrtf, ect. I don’t know what’s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.

    But how fast does it run?

    Comment by Mick West — February 8, 2007 @ 2:21 pm

  14. Actually, its not so much to hack it to make the assembly nice. Its a difference in how various types are treated.

    ValueType’s in .Net are a copy based mechanism, much like they are in other languages. Anything marked with the struct type in C# is considered to be a value type. Making it a class will not solve the problem though (actually, it makes it worse, but I’ll detail why in a bit). Since value-types are primarily a copy mechanism (the old immutable idea) then you’ve got to account for that in your code. Things like taking a reference to the value type (pass by-ref or as an out parameter) enable the CLR to realize how you are attempting to use these various objects, enabling it to eliminate redundant copies and the like. The JIT has very little time to run in, and so its not going to make the best decisions always (it does make a lot of smart ones though).

    Classes don’t solve the problem because they hare heap only. You cannot allocate a non-heap based class, as such the allocation of a class type will lead to a GC eventually. Now, Gen0 collections are INSANELY fast, but if you’re pumping out a lot of short lived temporaries, what will happen is that some objects will live a bit longer, and get pushed into the Gen1 collection. Then when the Gen1 collection gets full, it will also be collected, and those short lived temps will be released, but any that live just a bit longer could end up being pushed up into gen2. Gen2 takes a long time to free up, especially with the LOH being up there in the Gen2 as well. So obviously short lived temporaries can cost you…if you’re not careful.

    But, its not all horror stories, a gen0 collection is INSANELY fast, as I mentioned above, a wee bit of profiling on my part found it to be faster than a C++ allocation on a moderately fragmented heap. Since most C++ allocators use a heap walk to find a free chunk of memory to allocation (this is unspecified in the standard, as such not all implementations have to behave this way), that traversal can become quite expensive. With managed languages however, an allocation is a constant time operation. The GC will typically do a sweep and compact when a gen2 or gen1 collection happens, but in general, gen0 collections rarely have such actions. Plus, the GC typically doesn’t have to freeze your application[1].

    You have to watch out for finalizers though, since destruction is not a deterministic action, finalizers can cause short lived elements to be pushed into the Gen1 heap, when they are really just waiting to be finalized.

    [1] Using GC Efficiently – Part 1
    Using GC Efficiently – Part 2
    Using GC Efficiently – Part 3

    [2] Clearing up some confusion over finalization and other areas in GC

    Comment by Washu — February 8, 2007 @ 2:27 pm

  15. And now I’ve seen your whole code (nice comparison BTW), it’s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting. Whereas the C++ version just calls sqrtf, ect. I don’t know what’s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.

    Indeed, there are ways to get around that too, ways that will generate better/faster code, ways that a professional application would use…such as ngen. Using that pre-JIT the program on the target machine during installation will allow it more time to optimize. How much better it will be, I can’t be sure.

    But how fast does it run?

    Faster, I don’t have the entire sample done, and probably am not going to complete it either (I have other commitments) but profiling suggests that it is about 5-10% slower than yours (without SIMD).

    Which isn’t bad, it is not great, but overall that’s a significant performance boost of compiling with just /clr. Still won’t be able to write a HL2 killer in it, at least not yet. What is interesting are the future JIT compilers that are being invested in by Microsoft. JIT compilers that can optimize based on the machine configuration, something that current statically built applications cannot do. As an example, if you want to use SIMD, you typically have to build in several code-paths. Ones that can use SIMD, and ones that can’t. You then decide (at runtime) which path to take depending on the available features. The disadvantage here is a single level of indirection, and a whole hell of a lot of code on the developer’s part. A well written inner product can easily outperform the SSE generated by VSTS (Visual Studio Team System). Compilers just aren’t good at vectorization, even ones like the Intel compiler (which produces bugs when used, as translation of code to a vectorized format inherently changes the behavior of the application in unpredictable ways.

    A more advanced JIT will be able to target the processor that the machine is running on, including hardware extensions, when the application is launched. This presents an opportunity for extreme runtime optimizations based on extended instruction sets. The JIT will still be constrained to a shorter running time than a static compiler, but using tricks like NGEN, you will be able to really optimize it to a great extent.

    Comment by Washu — February 8, 2007 @ 2:41 pm

  16. I wrote an “old fashioned” version for kicks:


    Vector2 CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
    { float v_x = p_verlet->GetPos().x;
    float v_y = p_verlet->GetPos().y;
    float o_x = mp_other_verlet->GetPos().x;
    float o_y = mp_other_verlet->GetPos().y;
    float t_x = (v_x - o_x);
    float t_y = (v_y - o_y);
    float t_len = sqrtf(t_x*t_x + t_y*t_y);
    if (t_len <0.000001) // should be f, but leave it for comparison
    {
    t_x = 1.0f;
    t_y = 0.0f;
    t_len = 1.0f;
    }
    float mid = m_mid;
    float m_x = o_x + t_x/t_len*mid;
    float m_y = o_y + t_y/t_len*mid;
    float tm_x = m_x - v_x;
    float tm_y = m_y - v_y;
    float force = m_force;
    return Vector2(tm_x*force,tm_y*force);
    }

    And here's the comparison.

    https://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html

    And really I should be using the MD part of SIMD, which should be able to halve it again.

    So my original example was somewhat misleading. However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting.

    Comment by Mick West — February 8, 2007 @ 3:03 pm

  17. And here’s the comparison.

    https://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html

    And really I should be using the MD part of SIMD, which should be able to halve it again.

    So my original example was somewhat misleading. However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting.

    Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)

    What I do note is the lack of ‘good’ SIMD in most applications. For instance, the lack of usage of parallelizing of operations. Instead of performing 1 dot product at a time using SIMD, do 4…as in below (note that I do not claim that you should use the code below, just that its typically a better parallelization than single operations)

    ; Given that XMM0 - XMM7 contain R4 vectors V0 - V7, such that we wish to
    ; calculate the inner product of and return the results of
    ; all four inner products in a result vector.
    mulps xmm0, xmm1
    mulps xmm2, xmm3
    haddps xmm0, xmm2

    mulps xmm4, xmm5
    mulps xmm6, xmm7
    haddps xmm4, xmm6

    haddps xmm0, xmm4
    movaps result, xmm0

    Comment by Washu — February 8, 2007 @ 3:21 pm

  18. Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)

    Better ways to spend 90% of your time, that’s for sure. But I just made my application 8% faster with 15 minutes work (it went from 134 to 123 fps, in non-simd, just with that one change!). Remember what Knuth said about similar optimization:

    The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their ‘optimized’ programs

    Your point on “good” SIMD is well taken. I’m from console-land, and if you are doing anything intensive with vectors like this, then you want to be using SIMD (or preferably the vector units).

    It’s a period of transition. Managed code is great, but there’s still an appropriate division between game code and engine code. It’s not a hard and fast division, and depends on the game (and several other factors), but it’s still there. As managed code gets better at targeting hardware, then it will get used more.

    Comment by Mick West — February 8, 2007 @ 4:07 pm

  19. Simulando "blobs" 2D…

    En la web de Mick West hay un interesante tutorial sobre cómo programar blobs (o "goterones viscosos", a falta de un mejor sinónimo) como los que aparecen en juegos como Roco Loco….

    Trackback by pixelalo.com — September 10, 2007 @ 7:57 am

  20. Another awesome article, well done. I think this might be good for car tires too (Probably with stiffer springs). They could simulate nice bending, bulging and you could have a really simple traction policy too, traction = GetNumberOfPointsContactingRoad() / nNormalNumberPointsContactingRoad. Keep up the good work.

    Comment by Chris — February 22, 2009 @ 1:02 am

  21. Yeah, it probably would work for car tires. However it would also be rather a waste of computing power (and programming time), as the effects would be too slight to see unless you had some kind of “tire cam”. :)

    Comment by Mick West — February 22, 2009 @ 7:35 am

  22. Just stumbled on that site, nice articles. I programmed similar physics for a game called Soul Bubbles on the DS. Lots of moving blobs and geometry collisions (typical level 15k points), funny to fit into DS.

    You said:
    “This causes the spring system to have a net force imbalance in a particular direction (depending on the order of update). The solution here was very simple. I just split the loop up into two separate loops: one to gather the forces, and then one to apply them in the integration step. This ensured that all forces were symmetrical.”

    I generally solved this kind of pertubation by inverting the processing order on each update loop (eg: frame n update points to 0 to max, frame n+1 update points max to 0). It is as cache-friendly but allow to avoid iterating the points more than one time for this kind of operation, and generally tends to stabilize the system.

    The bubbles in Soul Bubbles are 32 verlet-integrated points to which are applied:
    – A single layer of spring constraints ( point N to point N+1%32 ).
    – Angle constraints to try to maintain an overall round shape. Applied by moving point N to tend to satisfy an ideal angle between N-1,N,N+1. I don’t recall the details but this proven to be different from applying strings between points N-1 and N+1.
    – A slight scaling up of all points from their average center to cope with external pressure when the surface became too small.
    – In cases where bubbles gets into thin/sharp objects collisions responses are biased to push points back toward blob center, and friction increased for those points to reduce penetration.

    Add in some thresholds to optimize idle cases and various common-sense low level optimization. In the end other components (such as managing the membranes between 2 collides bubbles) were more heavy.

    Comment by Omar — October 9, 2009 @ 5:09 am

  23. […] an excellent article by Mick West all about springs and blobby […]

    Pingback by Fun With Springs… | FML — July 20, 2011 @ 6:22 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

You must be logged in to post a comment.

Powered by WordPress