<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Blob Physics</title>
	<atom:link href="http://cowboyprogramming.com/2007/01/05/blob-physics/feed/" rel="self" type="application/rss+xml" />
	<link>http://cowboyprogramming.com/2007/01/05/blob-physics/</link>
	<description>Game Development and General Hacking by the Old West</description>
	<pubDate>Fri, 25 Jul 2008 18:25:25 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: pixelalo.com</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-3668</link>
		<dc:creator>pixelalo.com</dc:creator>
		<pubDate>Mon, 10 Sep 2007 15:57:35 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-3668</guid>
		<description>&lt;strong&gt;Simulando &#34;blobs&#34; 2D...&lt;/strong&gt;

En la web de Mick West hay un interesante tutorial sobre cómo programar blobs (o &#34;goterones viscosos&#34;, a falta de un mejor sinónimo) como los que aparecen en juegos como Roco Loco....</description>
		<content:encoded><![CDATA[<p><strong>Simulando &quot;blobs&quot; 2D&#8230;</strong></p>
<p>En la web de Mick West hay un interesante tutorial sobre cómo programar blobs (o &quot;goterones viscosos&quot;, a falta de un mejor sinónimo) como los que aparecen en juegos como Roco Loco&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-44</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Fri, 09 Feb 2007 00:07:44 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-44</guid>
		<description>&lt;blockquote&gt;Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)&lt;/blockquote&gt;

Better ways to spend 90% of your time, that's for sure.  But I just made my application 8% faster with 15 minutes work (it went from 134 to 123 fps,  in non-simd, just with that one change!).  Remember what Knuth said about similar optimization:

&lt;blockquote&gt;The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their ‘optimized’ programs&lt;/blockquote&gt;


Your point on "good" SIMD is well taken.  I'm from console-land, and if you are doing anything intensive with vectors like this, then you want to be using SIMD (or preferably the vector units).

It's a period of transition.  Managed code is great, but there's still an appropriate division between game code and engine code.   It's not a hard and fast division, and depends on the game (and several other factors), but it's still there.   As managed code gets better at targeting hardware, then it will get used more.</description>
		<content:encoded><![CDATA[<blockquote><p>Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)</p></blockquote>
<p>Better ways to spend 90% of your time, that&#8217;s for sure.  But I just made my application 8% faster with 15 minutes work (it went from 134 to 123 fps,  in non-simd, just with that one change!).  Remember what Knuth said about similar optimization:</p>
<blockquote><p>The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their ‘optimized’ programs</p></blockquote>
<p>Your point on &#8220;good&#8221; SIMD is well taken.  I&#8217;m from console-land, and if you are doing anything intensive with vectors like this, then you want to be using SIMD (or preferably the vector units).</p>
<p>It&#8217;s a period of transition.  Managed code is great, but there&#8217;s still an appropriate division between game code and engine code.   It&#8217;s not a hard and fast division, and depends on the game (and several other factors), but it&#8217;s still there.   As managed code gets better at targeting hardware, then it will get used more.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Washu</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-43</link>
		<dc:creator>Washu</dc:creator>
		<pubDate>Thu, 08 Feb 2007 23:21:39 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-43</guid>
		<description>&lt;blockquote&gt;
And here’s the comparison.

http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html

And really I should be using the MD part of SIMD, which should be able to halve it again.

So my original example was somewhat misleading. However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting. 
&lt;/blockquote&gt;
Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)

What I do note is the lack of 'good' SIMD in most applications. For instance, the lack of usage of parallelizing of operations. Instead of performing 1 dot product at a time using SIMD, do 4...as in below (note that I do not claim that you should use the code below, just that its typically a better parallelization than single operations)
&lt;code&gt;
; Given that XMM0 - XMM7 contain R4 vectors V0 - V7, such that we wish to
; calculate the inner product of  and return the results of
; all four inner products in a result vector.
mulps xmm0, xmm1
mulps xmm2, xmm3
haddps xmm0, xmm2

mulps xmm4, xmm5
mulps xmm6, xmm7
haddps xmm4, xmm6

haddps xmm0, xmm4
movaps result, xmm0
&lt;/code&gt;</description>
		<content:encoded><![CDATA[<blockquote><p>
And here’s the comparison.</p>
<p><a href="http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html" rel="nofollow">http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html</a></p>
<p>And really I should be using the MD part of SIMD, which should be able to halve it again.</p>
<p>So my original example was somewhat misleading. However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting.
</p></blockquote>
<p>Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)</p>
<p>What I do note is the lack of &#8216;good&#8217; SIMD in most applications. For instance, the lack of usage of parallelizing of operations. Instead of performing 1 dot product at a time using SIMD, do 4&#8230;as in below (note that I do not claim that you should use the code below, just that its typically a better parallelization than single operations)<br />
<code><br />
; Given that XMM0 - XMM7 contain R4 vectors V0 - V7, such that we wish to<br />
; calculate the inner product of  and return the results of<br />
; all four inner products in a result vector.<br />
mulps xmm0, xmm1<br />
mulps xmm2, xmm3<br />
haddps xmm0, xmm2</p>
<p>mulps xmm4, xmm5<br />
mulps xmm6, xmm7<br />
haddps xmm4, xmm6</p>
<p>haddps xmm0, xmm4<br />
movaps result, xmm0<br />
</code></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-42</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Thu, 08 Feb 2007 23:03:46 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-42</guid>
		<description>I wrote an "old fashioned" version for kicks:

&lt;code&gt;
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
{	float v_x = p_verlet-&gt;GetPos().x;
	float v_y = p_verlet-&gt;GetPos().y;
	float o_x = mp_other_verlet-&gt;GetPos().x;
	float o_y = mp_other_verlet-&gt;GetPos().y;
	float t_x = (v_x - o_x);
	float t_y = (v_y - o_y);
	float t_len = sqrtf(t_x*t_x + t_y*t_y);
	if (t_len &lt;0.000001) // should be f, but leave it for comparison
	{
		t_x = 1.0f;
		t_y = 0.0f;
		t_len = 1.0f;
	}
	float mid = m_mid;
	float m_x = o_x + t_x/t_len*mid;
	float m_y = o_y + t_y/t_len*mid;
	float tm_x = m_x - v_x;
	float tm_y = m_y - v_y;
	float force = m_force;
	return Vector2(tm_x*force,tm_y*force);
}
&lt;/code&gt;

And here's the comparison.

http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html

And really I should be using the MD part of SIMD, which should be able to halve it again.

So my original example was somewhat misleading.  However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting.</description>
		<content:encoded><![CDATA[<p>I wrote an &#8220;old fashioned&#8221; version for kicks:</p>
<p><code><br />
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)<br />
{	float v_x = p_verlet->GetPos().x;<br />
	float v_y = p_verlet->GetPos().y;<br />
	float o_x = mp_other_verlet->GetPos().x;<br />
	float o_y = mp_other_verlet->GetPos().y;<br />
	float t_x = (v_x - o_x);<br />
	float t_y = (v_y - o_y);<br />
	float t_len = sqrtf(t_x*t_x + t_y*t_y);<br />
	if (t_len &lt;0.000001) // should be f, but leave it for comparison<br />
	{<br />
		t_x = 1.0f;<br />
		t_y = 0.0f;<br />
		t_len = 1.0f;<br />
	}<br />
	float mid = m_mid;<br />
	float m_x = o_x + t_x/t_len*mid;<br />
	float m_y = o_y + t_y/t_len*mid;<br />
	float tm_x = m_x - v_x;<br />
	float tm_y = m_y - v_y;<br />
	float force = m_force;<br />
	return Vector2(tm_x*force,tm_y*force);<br />
}<br />
</code></p>
<p>And here&#8217;s the comparison.</p>
<p><a href="http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html" rel="nofollow">http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html</a></p>
<p>And really I should be using the MD part of SIMD, which should be able to halve it again.</p>
<p>So my original example was somewhat misleading.  However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Washu</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-40</link>
		<dc:creator>Washu</dc:creator>
		<pubDate>Thu, 08 Feb 2007 22:41:20 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-40</guid>
		<description>&lt;blockquote&gt;
And now I’ve seen your whole code (nice comparison BTW), it’s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting. Whereas the C++ version just calls sqrtf, ect. I don’t know what’s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.&lt;/blockquote&gt;
Indeed, there are ways to get around that too, ways that will generate better/faster code, ways that a professional application would use...such as ngen. Using that pre-JIT the program on the target machine during installation will allow it more time to optimize. How much better it will be, I can't be sure.

&lt;blockquote&gt;But how fast does it run?&lt;/blockquote&gt;
Faster, I don't have the entire sample done, and probably am not going to complete it either (I have other commitments) but profiling suggests that it is about 5-10% slower than yours (without SIMD).

Which isn't bad, it is not great, but overall that’s a significant performance boost of compiling with just /clr. Still won’t be able to write a HL2 killer in it, at least not yet. What is interesting are the future JIT compilers that are being invested in by Microsoft. JIT compilers that can optimize based on the machine configuration, something that current statically built applications cannot do. As an example, if you want to use SIMD, you typically have to build in several code-paths. Ones that can use SIMD, and ones that can’t. You then decide (at runtime) which path to take depending on the available features. The disadvantage here is a single level of indirection, and a whole hell of a lot of code on the developer's part. A well written inner product can easily outperform the SSE generated by VSTS (Visual Studio Team System). Compilers just aren’t good at vectorization, even ones like the Intel compiler (which produces bugs when used, as translation of code to a vectorized format inherently changes the behavior of the application in unpredictable ways.

A more advanced JIT will be able to target the processor that the machine is running on, including hardware extensions, when the application is launched. This presents an opportunity for extreme runtime optimizations based on extended instruction sets. The JIT will still be constrained to a shorter running time than a static compiler, but using tricks like NGEN, you will be able to really optimize it to a great extent.</description>
		<content:encoded><![CDATA[<blockquote><p>
And now I’ve seen your whole code (nice comparison BTW), it’s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting. Whereas the C++ version just calls sqrtf, ect. I don’t know what’s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.</p></blockquote>
<p>Indeed, there are ways to get around that too, ways that will generate better/faster code, ways that a professional application would use&#8230;such as ngen. Using that pre-JIT the program on the target machine during installation will allow it more time to optimize. How much better it will be, I can&#8217;t be sure.</p>
<blockquote><p>But how fast does it run?</p></blockquote>
<p>Faster, I don&#8217;t have the entire sample done, and probably am not going to complete it either (I have other commitments) but profiling suggests that it is about 5-10% slower than yours (without SIMD).</p>
<p>Which isn&#8217;t bad, it is not great, but overall that’s a significant performance boost of compiling with just /clr. Still won’t be able to write a HL2 killer in it, at least not yet. What is interesting are the future JIT compilers that are being invested in by Microsoft. JIT compilers that can optimize based on the machine configuration, something that current statically built applications cannot do. As an example, if you want to use SIMD, you typically have to build in several code-paths. Ones that can use SIMD, and ones that can’t. You then decide (at runtime) which path to take depending on the available features. The disadvantage here is a single level of indirection, and a whole hell of a lot of code on the developer&#8217;s part. A well written inner product can easily outperform the SSE generated by VSTS (Visual Studio Team System). Compilers just aren’t good at vectorization, even ones like the Intel compiler (which produces bugs when used, as translation of code to a vectorized format inherently changes the behavior of the application in unpredictable ways.</p>
<p>A more advanced JIT will be able to target the processor that the machine is running on, including hardware extensions, when the application is launched. This presents an opportunity for extreme runtime optimizations based on extended instruction sets. The JIT will still be constrained to a shorter running time than a static compiler, but using tricks like NGEN, you will be able to really optimize it to a great extent.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Washu</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-39</link>
		<dc:creator>Washu</dc:creator>
		<pubDate>Thu, 08 Feb 2007 22:27:16 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-39</guid>
		<description>Actually, its not so much to hack it to make the assembly nice. Its a difference in how various types are treated.

ValueType's in .Net are a copy based mechanism, much like they are in other languages. Anything marked with the struct type in C# is considered to be a value type. Making it a class will not solve the problem though (actually, it makes it worse, but I'll detail why in a bit). Since value-types are primarily a copy mechanism (the old immutable idea) then you've got to account for that in your code. Things like taking a reference to the value type (pass by-ref or as an out parameter) enable the CLR to realize how you are attempting to use these various objects, enabling it to eliminate redundant copies and the like. The JIT has very little time to run in, and so its not going to make the best decisions always (it does make a lot of smart ones though).

Classes don't solve the problem because they hare heap only. You cannot allocate a non-heap based class, as such the allocation of a class type will lead to a GC eventually. Now, Gen0 collections are INSANELY fast, but if you're pumping out a lot of short lived temporaries, what will happen is that some objects will live a bit longer, and get pushed into the Gen1 collection. Then when the Gen1 collection gets full, it will also be collected, and those short lived temps will be released, but any that live just a bit longer could end up being pushed up into gen2. Gen2 takes a long time to free up, especially with the LOH being up there in the Gen2 as well. So obviously short lived temporaries can cost you...if you're not careful.

But, its not all horror stories, a gen0 collection is INSANELY fast, as I mentioned above, a wee bit of profiling on my part found it to be faster than a C++ allocation on a moderately fragmented heap. Since most C++ allocators use a heap walk to find a free chunk of memory to allocation (this is unspecified in the standard, as such not all implementations have to behave this way), that traversal can become quite expensive. With managed languages however, an allocation is a constant time operation. The GC will typically do a sweep and compact when a gen2 or gen1 collection happens, but in general, gen0 collections rarely have such actions. Plus, the GC typically doesn't have to freeze your application[1].

You have to watch out for finalizers though, since destruction is not a deterministic action, finalizers can cause short lived elements to be pushed into the Gen1 heap, when they are really just waiting to be finalized.

[1] &lt;a href="http://blogs.msdn.com/maoni/archive/2004/06/15/156626.aspx" rel="nofollow"&gt;Using GC Efficiently – Part 1&lt;/a&gt;
    &lt;a href="http://blogs.msdn.com/maoni/archive/2004/09/25/234273.aspx" rel="nofollow"&gt;Using GC Efficiently – Part 2&lt;/a&gt;
    &lt;a href="http://blogs.msdn.com/maoni/archive/2004/12/19/327149.aspx" rel="nofollow"&gt;Using GC Efficiently – Part 3&lt;/a&gt;

[2] &lt;a href="http://blogs.msdn.com/maoni/archive/2004/11/04/252697.aspx" rel="nofollow"&gt;Clearing up some confusion over finalization and other areas in GC&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>Actually, its not so much to hack it to make the assembly nice. Its a difference in how various types are treated.</p>
<p>ValueType&#8217;s in .Net are a copy based mechanism, much like they are in other languages. Anything marked with the struct type in C# is considered to be a value type. Making it a class will not solve the problem though (actually, it makes it worse, but I&#8217;ll detail why in a bit). Since value-types are primarily a copy mechanism (the old immutable idea) then you&#8217;ve got to account for that in your code. Things like taking a reference to the value type (pass by-ref or as an out parameter) enable the CLR to realize how you are attempting to use these various objects, enabling it to eliminate redundant copies and the like. The JIT has very little time to run in, and so its not going to make the best decisions always (it does make a lot of smart ones though).</p>
<p>Classes don&#8217;t solve the problem because they hare heap only. You cannot allocate a non-heap based class, as such the allocation of a class type will lead to a GC eventually. Now, Gen0 collections are INSANELY fast, but if you&#8217;re pumping out a lot of short lived temporaries, what will happen is that some objects will live a bit longer, and get pushed into the Gen1 collection. Then when the Gen1 collection gets full, it will also be collected, and those short lived temps will be released, but any that live just a bit longer could end up being pushed up into gen2. Gen2 takes a long time to free up, especially with the LOH being up there in the Gen2 as well. So obviously short lived temporaries can cost you&#8230;if you&#8217;re not careful.</p>
<p>But, its not all horror stories, a gen0 collection is INSANELY fast, as I mentioned above, a wee bit of profiling on my part found it to be faster than a C++ allocation on a moderately fragmented heap. Since most C++ allocators use a heap walk to find a free chunk of memory to allocation (this is unspecified in the standard, as such not all implementations have to behave this way), that traversal can become quite expensive. With managed languages however, an allocation is a constant time operation. The GC will typically do a sweep and compact when a gen2 or gen1 collection happens, but in general, gen0 collections rarely have such actions. Plus, the GC typically doesn&#8217;t have to freeze your application[1].</p>
<p>You have to watch out for finalizers though, since destruction is not a deterministic action, finalizers can cause short lived elements to be pushed into the Gen1 heap, when they are really just waiting to be finalized.</p>
<p>[1] <a href="http://blogs.msdn.com/maoni/archive/2004/06/15/156626.aspx" rel="nofollow">Using GC Efficiently – Part 1</a><br />
    <a href="http://blogs.msdn.com/maoni/archive/2004/09/25/234273.aspx" rel="nofollow">Using GC Efficiently – Part 2</a><br />
    <a href="http://blogs.msdn.com/maoni/archive/2004/12/19/327149.aspx" rel="nofollow">Using GC Efficiently – Part 3</a></p>
<p>[2] <a href="http://blogs.msdn.com/maoni/archive/2004/11/04/252697.aspx" rel="nofollow">Clearing up some confusion over finalization and other areas in GC</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-38</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Thu, 08 Feb 2007 22:21:53 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-38</guid>
		<description>And now I've seen your whole code (nice comparison BTW), it's still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting.  Whereas the C++ version just calls sqrtf, ect.  I don't know what's in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.  

But how fast does it run?</description>
		<content:encoded><![CDATA[<p>And now I&#8217;ve seen your whole code (nice comparison BTW), it&#8217;s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting.  Whereas the C++ version just calls sqrtf, ect.  I don&#8217;t know what&#8217;s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.  </p>
<p>But how fast does it run?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-37</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Thu, 08 Feb 2007 22:13:49 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-37</guid>
		<description>I take it your code got chopped off at the &#60;, maybe not closing both your &#60;code;&#62; tags?

But I get your point, you can massage the code using what are essentially &lt;a href="http://en.wikipedia.org/wiki/Intrinsic_function" rel="nofollow"&gt;intrinsics&lt;/a&gt;, to get the JIT compiler to spit out code that equals or betters the performance of native compiled C++ code (at least in a straightforward situation like this).

There are a number of calls in your code, which helps make it shorter.  Of course size is not so important here - it's how fast it runs.

And it seems a little counterintuitive, to be having to hack away at code to massage the resultant assembly.</description>
		<content:encoded><![CDATA[<p>I take it your code got chopped off at the &lt;, maybe not closing both your &lt;code;&gt; tags?</p>
<p>But I get your point, you can massage the code using what are essentially <a href="http://en.wikipedia.org/wiki/Intrinsic_function" rel="nofollow">intrinsics</a>, to get the JIT compiler to spit out code that equals or betters the performance of native compiled C++ code (at least in a straightforward situation like this).</p>
<p>There are a number of calls in your code, which helps make it shorter.  Of course size is not so important here - it&#8217;s how fast it runs.</p>
<p>And it seems a little counterintuitive, to be having to hack away at code to massage the resultant assembly.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Washu</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-36</link>
		<dc:creator>Washu</dc:creator>
		<pubDate>Thu, 08 Feb 2007 21:57:02 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-36</guid>
		<description>I do appologize for the spam. It would appear that there is a limit to the length of a reply I can send, so I've moved the comparisons off to &lt;a href="http://www.scapecode.com/stuff/ComparisonOfManagedOptimizations.html" rel="nofollow"&gt;a file on my site&lt;/a&gt;.
It’s a bit of both, in reality. The CLR can’t currently take good advantage of SIMD instructions yet, so that’s obviously a ding against it. But the C++/CLI compiler is not the brightest when dealing with doing an unmanaged -&#62; managed conversion.

The above contains a naïve implementation of your function, on the left. It is significantly shorter than the one above that the C++/CLI compiler generated, still not that great though. So what kinds of problems are there? Well, there is a lot of excessive copying. Operators like - generate temporaries...so if we write like a managed person in a performant environment would, mind you, not completely sacrificing readability...which you can see on the right.

Mind you, i could get the right one down to about the same size of yours, without SIMD benefits, but it sacrifices readability.</description>
		<content:encoded><![CDATA[<p>I do appologize for the spam. It would appear that there is a limit to the length of a reply I can send, so I&#8217;ve moved the comparisons off to <a href="http://www.scapecode.com/stuff/ComparisonOfManagedOptimizations.html" rel="nofollow">a file on my site</a>.<br />
It’s a bit of both, in reality. The CLR can’t currently take good advantage of SIMD instructions yet, so that’s obviously a ding against it. But the C++/CLI compiler is not the brightest when dealing with doing an unmanaged -&gt; managed conversion.</p>
<p>The above contains a naïve implementation of your function, on the left. It is significantly shorter than the one above that the C++/CLI compiler generated, still not that great though. So what kinds of problems are there? Well, there is a lot of excessive copying. Operators like - generate temporaries&#8230;so if we write like a managed person in a performant environment would, mind you, not completely sacrificing readability&#8230;which you can see on the right.</p>
<p>Mind you, i could get the right one down to about the same size of yours, without SIMD benefits, but it sacrifices readability.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/#comment-32</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Thu, 08 Feb 2007 19:27:44 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-32</guid>
		<description>Okay, I feel that the above shows that the iteration overhead is not as great as I might have though.   Now much of my code is doing floating point vector calculations like this:

&lt;code&gt;
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
{
	Vector2	to_me = p_verlet-&gt;GetPos() - mp_other_verlet-&gt;GetPos();
	if (to_me.Length() &lt; 0.000001)
	{
		to_me = Vector2(1.0f,0.0f);
	}
	Vector2	mid = mp_other_verlet-&gt;GetPos() + to_me.Normal()*m_mid;
	Vector2	to_mid = mid-p_verlet-&gt;GetPos() ;
	return to_mid*m_force; 
}
&lt;/code&gt;

Which in my unmanaged version compiles as:
&lt;code&gt;
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
{
004062F0  push        ebp  
004062F1  mov         ebp,esp 
004062F3  and         esp,0FFFFFFF8h 
	Vector2	to_me = p_verlet-&gt;GetPos() - mp_other_verlet-&gt;GetPos();
004062F6  mov         edx,dword ptr [ebp+0Ch] 
004062F9  mov         eax,dword ptr [ecx+4] 
004062FC  movss       xmm0,dword ptr [eax] 
00406300  movss       xmm1,dword ptr [eax+4] 
00406305  movss       xmm4,dword ptr [edx] 
00406309  movss       xmm5,dword ptr [edx+4] 
	if (to_me.Length() &lt; 0.000001)
0040630E  xorps       xmm6,xmm6 
00406311  movaps      xmm2,xmm4 
00406314  subss       xmm2,xmm0 
00406318  movaps      xmm3,xmm5 
0040631B  subss       xmm3,xmm1 
0040631F  movaps      xmm0,xmm3 
00406322  mulss       xmm0,xmm3 
00406326  movaps      xmm1,xmm2 
00406329  mulss       xmm1,xmm2 
0040632D  addss       xmm0,xmm1 
00406331  movsd       xmm1,mmword ptr [__real@3eb0c6f7a0b5ed8d (418060h)] 
00406339  sqrtss      xmm0,xmm0 
0040633D  sub         esp,8 
00406340  cvtps2pd    xmm0,xmm0 
00406343  comisd      xmm1,xmm0 
00406347  movss       xmm1,dword ptr [__real@3f800000 (41803Ch)] 
0040634F  jbe         CSemiRigidConstraint::GetForce+67h (406357h) 
	{
		to_me = Vector2(1.0f,0.0f);
00406351  movaps      xmm2,xmm1 
00406354  movaps      xmm3,xmm6 
	}
	Vector2	mid = mp_other_verlet-&gt;GetPos() + to_me.Normal()*m_mid;
00406357  movaps      xmm0,xmm3 
0040635A  mulss       xmm0,xmm3 
0040635E  movaps      xmm7,xmm2 
00406361  mulss       xmm7,xmm2 
00406365  addss       xmm0,xmm7 
00406369  sqrtss      xmm0,xmm0 
0040636D  comiss      xmm0,xmm6 
00406370  jbe         CSemiRigidConstraint::GetForce+93h (406383h) 
00406372  divss       xmm1,xmm0 
00406376  movaps      xmm0,xmm1 
00406379  mulss       xmm0,xmm2 
0040637D  mulss       xmm1,xmm3 
00406381  jmp         CSemiRigidConstraint::GetForce+9Eh (40638Eh) 
00406383  movss       xmm1,dword ptr [esp+4] 
00406389  movss       xmm0,dword ptr [esp] 
0040638E  movss       xmm2,dword ptr [ecx+0Ch] 
00406393  mulss       xmm1,xmm2 
00406397  mulss       xmm0,xmm2 
0040639B  movss       xmm2,dword ptr [eax+4] 
004063A0  movaps      xmm3,xmm1 
004063A3  movss       xmm1,dword ptr [eax] 
	Vector2	to_mid = mid-p_verlet-&gt;GetPos() ;
	return to_mid*m_force; 
004063A7  mov         eax,dword ptr [ebp+8] 
004063AA  addss       xmm1,xmm0 
004063AE  addss       xmm2,xmm3 
004063B2  movaps      xmm0,xmm4 
004063B5  subss       xmm1,xmm0 
004063B9  movss       xmm0,dword ptr [ecx+14h] 
004063BE  movaps      xmm3,xmm5 
004063C1  subss       xmm2,xmm3 
004063C5  mulss       xmm1,xmm0 
004063C9  mulss       xmm2,xmm0 
004063CD  movss       dword ptr [eax],xmm1 
004063D1  movss       dword ptr [eax+4],xmm2 
}
004063D6  mov         esp,ebp 
004063D8  pop         ebp  
004063D9  ret         8
&lt;/code&gt;

Now that's using SIMD extensions, we we can't in CLR, so let's turn them off:
&lt;code&gt;
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
{
	Vector2	to_me = p_verlet-&gt;GetPos() - mp_other_verlet-&gt;GetPos();
00405860  mov         edx,dword ptr [ecx+4] 
00405863  fld         dword ptr [edx] 
00405865  sub         esp,8 
00405868  fld         dword ptr [edx+4] 
0040586B  push        esi  
0040586C  mov         esi,dword ptr [esp+14h] 
00405870  fld         dword ptr [esi] 
00405872  fsubrp      st(2),st 
00405874  fsubr       dword ptr [esi+4] 
	if (to_me.Length() &lt; 0.000001)
00405877  fld         st(0) 
00405879  fmul        st,st(1) 
0040587B  fld         st(2) 
0040587D  fmul        st,st(3) 
0040587F  faddp       st(1),st 
00405881  fsqrt            
00405883  fcomp       qword ptr [__real@3eb0c6f7a0b5ed8d (414508h)] 
00405889  fnstsw      ax   
0040588B  fld1             
0040588D  test        ah,5 
00405890  fldz             
00405892  jp          CSemiRigidConstraint::GetForce+46h (4058A6h) 
	{
		to_me = Vector2(1.0f,0.0f);
00405894  fstp        st(3) 
00405896  fstp        st(1) 
00405898  fld         st(0) 
0040589A  fld         st(2) 
0040589C  fxch        st(1) 
0040589E  fxch        st(3) 
004058A0  fxch        st(1) 
004058A2  fxch        st(2) 
004058A4  fxch        st(1) 
	}
	Vector2	mid = mp_other_verlet-&gt;GetPos() + to_me.Normal()*m_mid;
004058A6  fld         st(2) 
004058A8  fmul        st,st(3) 
004058AA  fld         st(4) 
004058AC  fmul        st,st(5) 
004058AE  faddp       st(1),st 
004058B0  fsqrt            
004058B2  fcom        st(1) 
004058B4  fnstsw      ax   
004058B6  fstp        st(1) 
004058B8  test        ah,41h 
004058BB  jne         CSemiRigidConstraint::GetForce+67h (4058C7h) 
004058BD  fdivp       st(1),st 
004058BF  fld         st(0) 
004058C1  fmulp       st(3),st 
004058C3  fmulp       st(1),st 
004058C5  jmp         CSemiRigidConstraint::GetForce+77h (4058D7h) 
004058C7  fstp        st(0) 
004058C9  fstp        st(2) 
004058CB  fstp        st(0) 
004058CD  fstp        st(0) 
004058CF  fld         dword ptr [esp+4] 
004058D3  fld         dword ptr [esp+8] 
004058D7  fld         dword ptr [ecx+0Ch] 
	Vector2	to_mid = mid-p_verlet-&gt;GetPos() ;
	return to_mid*m_force; 
004058DA  mov         eax,dword ptr [esp+10h] 
004058DE  fmul        st(2),st 
004058E0  fmulp       st(1),st 
004058E2  fld         dword ptr [edx] 
004058E4  faddp       st(2),st 
004058E6  fadd        dword ptr [edx+4] 
004058E9  fld         dword ptr [esi] 
004058EB  fld         dword ptr [esi+4] 
004058EE  pop         esi  
004058EF  fxch        st(3) 
004058F1  fsubrp      st(1),st 
004058F3  fxch        st(1) 
004058F5  fsubrp      st(2),st 
004058F7  fld         dword ptr [ecx+14h] 
004058FA  fmul        st(1),st 
004058FC  fxch        st(1) 
004058FE  fstp        dword ptr [eax] 
00405900  fmulp       st(1),st 
00405902  fstp        dword ptr [eax+4] 
}
00405905  add         esp,8 
00405908  ret         8 
&lt;/code&gt;
Similar size, just doing the FP on the FP stack rather than in SIMD registers

Now look at the /clr version:
&lt;code&gt;
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
{
	Vector2	to_me = p_verlet-&gt;GetPos() - mp_other_verlet-&gt;GetPos();
00000000  push        ebp  
00000001  mov         ebp,esp 
00000003  push        edi  
00000004  push        esi  
00000005  push        ebx  
00000006  sub         esp,94h 
0000000c  mov         esi,ecx 
0000000e  mov         edi,edx 
00000010  cmp         dword ptr ds:[006C2DC8h],0 
00000017  je          0000001E 
00000019  call        78DE2926 
0000001e  fldz             
00000020  fstp        dword ptr [ebp-10h] 
00000023  xor         ebx,ebx 
00000025  fldz             
00000027  fstp        dword ptr [ebp-14h] 
0000002a  fldz             
0000002c  fstp        dword ptr [ebp-18h] 
0000002f  fldz             
00000031  fstp        dword ptr [ebp-1Ch] 
00000034  fldz             
00000036  fstp        dword ptr [ebp-20h] 
00000039  fldz             
0000003b  fstp        dword ptr [ebp-24h] 
0000003e  xor         edx,edx 
00000040  mov         dword ptr [ebp-28h],edx 
00000043  xor         edx,edx 
00000045  mov         dword ptr [ebp-2Ch],edx 
00000048  fldz             
0000004a  fstp        dword ptr [ebp-30h] 
0000004d  fldz             
0000004f  fstp        dword ptr [ebp-34h] 
00000052  fldz             
00000054  fstp        dword ptr [ebp-38h] 
00000057  fldz             
00000059  fstp        dword ptr [ebp-3Ch] 
0000005c  fldz             
0000005e  fstp        dword ptr [ebp-40h] 
00000061  fldz             
00000063  fstp        dword ptr [ebp-44h] 
00000066  fldz             
00000068  fstp        dword ptr [ebp-48h] 
0000006b  fldz             
0000006d  fstp        dword ptr [ebp-4Ch] 
00000070  fldz             
00000072  fstp        dword ptr [ebp-50h] 
00000075  fldz             
00000077  fstp        dword ptr [ebp-54h] 
0000007a  mov         eax,dword ptr [esi+4] 
0000007d  mov         dword ptr [ebp-2Ch],eax 
00000080  mov         eax,dword ptr [ebp-2Ch] 
00000083  mov         dword ptr [ebp-28h],eax 
00000086  mov         eax,dword ptr [ebp-28h] 
00000089  fld         dword ptr [eax] 
0000008b  fstp        dword ptr [ebp+FFFFFF7Ch] 
00000091  mov         eax,dword ptr [ebp-28h] 
00000094  fld         dword ptr [eax+4] 
00000097  fstp        dword ptr [ebp-80h] 
0000009a  mov         eax,dword ptr [ebp+8] 
0000009d  fld         dword ptr [eax] 
0000009f  fstp        dword ptr [ebp-7Ch] 
000000a2  mov         eax,dword ptr [ebp+8] 
000000a5  fld         dword ptr [eax+4] 
000000a8  fstp        dword ptr [ebp-78h] 
000000ab  fld         dword ptr [ebp-7Ch] 
000000ae  fsub        dword ptr [ebp+FFFFFF7Ch] 
000000b4  fstp        dword ptr [ebp-24h] 
000000b7  fld         dword ptr [ebp-78h] 
000000ba  fsub        dword ptr [ebp-80h] 
000000bd  fstp        dword ptr [ebp-20h] 
000000c0  fld         dword ptr [ebp-24h] 
000000c3  fstp        dword ptr [ebp-74h] 
000000c6  fld         dword ptr [ebp-20h] 
000000c9  fstp        dword ptr [ebp-70h] 
	if (to_me.Length() &lt; 0.000001)
000000cc  fld         dword ptr [ebp-20h] 
000000cf  fmul        st,st(0) 
000000d1  fld         dword ptr [ebp-24h] 
000000d4  fmul        st,st(0) 
000000d6  faddp       st(1),st 
000000d8  fstp        dword ptr [ebp-54h] 
000000db  fld         dword ptr [ebp-54h] 
000000de  fsqrt            
000000e0  fstp        qword ptr [ebp+FFFFFF74h] 
000000e6  fld         qword ptr [ebp+FFFFFF74h] 
000000ec  fstp        dword ptr [ebp+FFFFFF68h] 
000000f2  fld         dword ptr [ebp+FFFFFF68h] 
000000f8  fstp        qword ptr [ebp+FFFFFF60h] 
000000fe  fld         qword ptr [ebp+FFFFFF60h] 
00000104  fld         qword ptr ds:[012AFBD0h] 
0000010a  fcomip      st,st(1) 
0000010c  fstp        st(0) 
0000010e  jp          0000011C 
00000110  jbe         0000011C 
	{
		to_me = Vector2(1.0f,0.0f);
00000112  fld1             
00000114  fstp        dword ptr [ebp-74h] 
00000117  fldz             
00000119  fstp        dword ptr [ebp-70h] 
	}
	Vector2	mid = mp_other_verlet-&gt;GetPos() + to_me.Normal()*m_mid;
0000011c  fld         dword ptr [ebp-70h] 
0000011f  fmul        st,st(0) 
00000121  fld         dword ptr [ebp-74h] 
00000124  fmul        st,st(0) 
00000126  faddp       st(1),st 
00000128  fstp        dword ptr [ebp-50h] 
0000012b  fld         dword ptr [ebp-50h] 
0000012e  fsqrt            
00000130  fstp        qword ptr [ebp+FFFFFF6Ch] 
00000136  fld         qword ptr [ebp+FFFFFF6Ch] 
0000013c  fstp        dword ptr [ebp-1Ch] 
0000013f  fld         dword ptr [ebp-1Ch] 
00000142  fldz             
00000144  fcomip      st,st(1) 
00000146  fstp        st(0) 
00000148  jp          00000168 
0000014a  jae         00000168 
0000014c  fld         dword ptr [ebp-1Ch] 
0000014f  fld1             
00000151  fdivrp      st(1),st 
00000153  fstp        dword ptr [ebp-18h] 
00000156  fld         dword ptr [ebp-18h] 
00000159  fmul        dword ptr [ebp-74h] 
0000015c  fstp        dword ptr [ebp-6Ch] 
0000015f  fld         dword ptr [ebp-18h] 
00000162  fmul        dword ptr [ebp-70h] 
00000165  fstp        dword ptr [ebp-68h] 
00000168  fld         dword ptr [esi+0Ch] 
0000016b  fstp        dword ptr [ebp-14h] 
0000016e  fld         dword ptr [ebp-6Ch] 
00000171  fmul        dword ptr [ebp-14h] 
00000174  fstp        dword ptr [ebp-4Ch] 
00000177  fld         dword ptr [ebp-68h] 
0000017a  fmul        dword ptr [ebp-14h] 
0000017d  fstp        dword ptr [ebp-48h] 
00000180  mov         ebx,dword ptr [ebp-2Ch] 
00000183  fld         dword ptr [ebx] 
00000185  fstp        dword ptr [ebp-64h] 
00000188  fld         dword ptr [ebx+4] 
0000018b  fstp        dword ptr [ebp-60h] 
0000018e  fld         dword ptr [ebp-64h] 
00000191  fadd        dword ptr [ebp-4Ch] 
00000194  fstp        dword ptr [ebp-44h] 
00000197  fld         dword ptr [ebp-60h] 
0000019a  fadd        dword ptr [ebp-48h] 
0000019d  fstp        dword ptr [ebp-40h] 
	Vector2	to_mid = mid-p_verlet-&gt;GetPos() ;
000001a0  mov         eax,dword ptr [ebp+8] 
000001a3  fld         dword ptr [eax] 
000001a5  fstp        dword ptr [ebp-5Ch] 
000001a8  mov         eax,dword ptr [ebp+8] 
000001ab  fld         dword ptr [eax+4] 
000001ae  fstp        dword ptr [ebp-58h] 
000001b1  fld         dword ptr [ebp-44h] 
000001b4  fsub        dword ptr [ebp-5Ch] 
000001b7  fstp        dword ptr [ebp-3Ch] 
000001ba  fld         dword ptr [ebp-40h] 
000001bd  fsub        dword ptr [ebp-58h] 
000001c0  fstp        dword ptr [ebp-38h] 
	return to_mid*m_force;  //
000001c3  fld         dword ptr [esi+14h] 
000001c6  fstp        dword ptr [ebp-10h] 
000001c9  fld         dword ptr [ebp-3Ch] 
000001cc  fmul        dword ptr [ebp-10h] 
000001cf  fstp        dword ptr [ebp-34h] 
000001d2  fld         dword ptr [ebp-38h] 
000001d5  fmul        dword ptr [ebp-10h] 
000001d8  fstp        dword ptr [ebp-30h] 
000001db  fld         dword ptr [ebp-34h] 
000001de  fstp        dword ptr [edi] 
000001e0  fld         dword ptr [ebp-30h] 
000001e3  fstp        dword ptr [edi+4] 
000001e6  mov         eax,edi 
000001e8  lea         esp,[ebp-0Ch] 
000001eb  pop         ebx  
000001ec  pop         esi  
000001ed  pop         edi  
000001ee  pop         ebp  
000001ef  ret         4   
&lt;/code&gt;

Even just glancing over it, it look like it would take twice as long, possible more depending on the memory cache situation.  Is this just bad CLR compiling, or am I doing something wrong?</description>
		<content:encoded><![CDATA[<p>Okay, I feel that the above shows that the iteration overhead is not as great as I might have though.   Now much of my code is doing floating point vector calculations like this:</p>
<p><code><br />
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)<br />
{<br />
	Vector2	to_me = p_verlet->GetPos() - mp_other_verlet->GetPos();<br />
	if (to_me.Length() < 0.000001)<br />
	{<br />
		to_me = Vector2(1.0f,0.0f);<br />
	}<br />
	Vector2	mid = mp_other_verlet->GetPos() + to_me.Normal()*m_mid;<br />
	Vector2	to_mid = mid-p_verlet->GetPos() ;<br />
	return to_mid*m_force;<br />
}<br />
</code></p>
<p>Which in my unmanaged version compiles as:<br />
<code><br />
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)<br />
{<br />
004062F0  push        ebp<br />
004062F1  mov         ebp,esp<br />
004062F3  and         esp,0FFFFFFF8h<br />
	Vector2	to_me = p_verlet->GetPos() - mp_other_verlet->GetPos();<br />
004062F6  mov         edx,dword ptr [ebp+0Ch]<br />
004062F9  mov         eax,dword ptr [ecx+4]<br />
004062FC  movss       xmm0,dword ptr [eax]<br />
00406300  movss       xmm1,dword ptr [eax+4]<br />
00406305  movss       xmm4,dword ptr [edx]<br />
00406309  movss       xmm5,dword ptr [edx+4]<br />
	if (to_me.Length() < 0.000001)<br />
0040630E  xorps       xmm6,xmm6<br />
00406311  movaps      xmm2,xmm4<br />
00406314  subss       xmm2,xmm0<br />
00406318  movaps      xmm3,xmm5<br />
0040631B  subss       xmm3,xmm1<br />
0040631F  movaps      xmm0,xmm3<br />
00406322  mulss       xmm0,xmm3<br />
00406326  movaps      xmm1,xmm2<br />
00406329  mulss       xmm1,xmm2<br />
0040632D  addss       xmm0,xmm1<br />
00406331  movsd       xmm1,mmword ptr [__real@3eb0c6f7a0b5ed8d (418060h)]<br />
00406339  sqrtss      xmm0,xmm0<br />
0040633D  sub         esp,8<br />
00406340  cvtps2pd    xmm0,xmm0<br />
00406343  comisd      xmm1,xmm0<br />
00406347  movss       xmm1,dword ptr [__real@3f800000 (41803Ch)]<br />
0040634F  jbe         CSemiRigidConstraint::GetForce+67h (406357h)<br />
	{<br />
		to_me = Vector2(1.0f,0.0f);<br />
00406351  movaps      xmm2,xmm1<br />
00406354  movaps      xmm3,xmm6<br />
	}<br />
	Vector2	mid = mp_other_verlet->GetPos() + to_me.Normal()*m_mid;<br />
00406357  movaps      xmm0,xmm3<br />
0040635A  mulss       xmm0,xmm3<br />
0040635E  movaps      xmm7,xmm2<br />
00406361  mulss       xmm7,xmm2<br />
00406365  addss       xmm0,xmm7<br />
00406369  sqrtss      xmm0,xmm0<br />
0040636D  comiss      xmm0,xmm6<br />
00406370  jbe         CSemiRigidConstraint::GetForce+93h (406383h)<br />
00406372  divss       xmm1,xmm0<br />
00406376  movaps      xmm0,xmm1<br />
00406379  mulss       xmm0,xmm2<br />
0040637D  mulss       xmm1,xmm3<br />
00406381  jmp         CSemiRigidConstraint::GetForce+9Eh (40638Eh)<br />
00406383  movss       xmm1,dword ptr [esp+4]<br />
00406389  movss       xmm0,dword ptr [esp]<br />
0040638E  movss       xmm2,dword ptr [ecx+0Ch]<br />
00406393  mulss       xmm1,xmm2<br />
00406397  mulss       xmm0,xmm2<br />
0040639B  movss       xmm2,dword ptr [eax+4]<br />
004063A0  movaps      xmm3,xmm1<br />
004063A3  movss       xmm1,dword ptr [eax]<br />
	Vector2	to_mid = mid-p_verlet->GetPos() ;<br />
	return to_mid*m_force;<br />
004063A7  mov         eax,dword ptr [ebp+8]<br />
004063AA  addss       xmm1,xmm0<br />
004063AE  addss       xmm2,xmm3<br />
004063B2  movaps      xmm0,xmm4<br />
004063B5  subss       xmm1,xmm0<br />
004063B9  movss       xmm0,dword ptr [ecx+14h]<br />
004063BE  movaps      xmm3,xmm5<br />
004063C1  subss       xmm2,xmm3<br />
004063C5  mulss       xmm1,xmm0<br />
004063C9  mulss       xmm2,xmm0<br />
004063CD  movss       dword ptr [eax],xmm1<br />
004063D1  movss       dword ptr [eax+4],xmm2<br />
}<br />
004063D6  mov         esp,ebp<br />
004063D8  pop         ebp<br />
004063D9  ret         8<br />
</code></p>
<p>Now that&#8217;s using SIMD extensions, we we can&#8217;t in CLR, so let&#8217;s turn them off:<br />
<code><br />
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)<br />
{<br />
	Vector2	to_me = p_verlet->GetPos() - mp_other_verlet->GetPos();<br />
00405860  mov         edx,dword ptr [ecx+4]<br />
00405863  fld         dword ptr [edx]<br />
00405865  sub         esp,8<br />
00405868  fld         dword ptr [edx+4]<br />
0040586B  push        esi<br />
0040586C  mov         esi,dword ptr [esp+14h]<br />
00405870  fld         dword ptr [esi]<br />
00405872  fsubrp      st(2),st<br />
00405874  fsubr       dword ptr [esi+4]<br />
	if (to_me.Length() < 0.000001)<br />
00405877  fld         st(0)<br />
00405879  fmul        st,st(1)<br />
0040587B  fld         st(2)<br />
0040587D  fmul        st,st(3)<br />
0040587F  faddp       st(1),st<br />
00405881  fsqrt<br />
00405883  fcomp       qword ptr [__real@3eb0c6f7a0b5ed8d (414508h)]<br />
00405889  fnstsw      ax<br />
0040588B  fld1<br />
0040588D  test        ah,5<br />
00405890  fldz<br />
00405892  jp          CSemiRigidConstraint::GetForce+46h (4058A6h)<br />
	{<br />
		to_me = Vector2(1.0f,0.0f);<br />
00405894  fstp        st(3)<br />
00405896  fstp        st(1)<br />
00405898  fld         st(0)<br />
0040589A  fld         st(2)<br />
0040589C  fxch        st(1)<br />
0040589E  fxch        st(3)<br />
004058A0  fxch        st(1)<br />
004058A2  fxch        st(2)<br />
004058A4  fxch        st(1)<br />
	}<br />
	Vector2	mid = mp_other_verlet->GetPos() + to_me.Normal()*m_mid;<br />
004058A6  fld         st(2)<br />
004058A8  fmul        st,st(3)<br />
004058AA  fld         st(4)<br />
004058AC  fmul        st,st(5)<br />
004058AE  faddp       st(1),st<br />
004058B0  fsqrt<br />
004058B2  fcom        st(1)<br />
004058B4  fnstsw      ax<br />
004058B6  fstp        st(1)<br />
004058B8  test        ah,41h<br />
004058BB  jne         CSemiRigidConstraint::GetForce+67h (4058C7h)<br />
004058BD  fdivp       st(1),st<br />
004058BF  fld         st(0)<br />
004058C1  fmulp       st(3),st<br />
004058C3  fmulp       st(1),st<br />
004058C5  jmp         CSemiRigidConstraint::GetForce+77h (4058D7h)<br />
004058C7  fstp        st(0)<br />
004058C9  fstp        st(2)<br />
004058CB  fstp        st(0)<br />
004058CD  fstp        st(0)<br />
004058CF  fld         dword ptr [esp+4]<br />
004058D3  fld         dword ptr [esp+8]<br />
004058D7  fld         dword ptr [ecx+0Ch]<br />
	Vector2	to_mid = mid-p_verlet->GetPos() ;<br />
	return to_mid*m_force;<br />
004058DA  mov         eax,dword ptr [esp+10h]<br />
004058DE  fmul        st(2),st<br />
004058E0  fmulp       st(1),st<br />
004058E2  fld         dword ptr [edx]<br />
004058E4  faddp       st(2),st<br />
004058E6  fadd        dword ptr [edx+4]<br />
004058E9  fld         dword ptr [esi]<br />
004058EB  fld         dword ptr [esi+4]<br />
004058EE  pop         esi<br />
004058EF  fxch        st(3)<br />
004058F1  fsubrp      st(1),st<br />
004058F3  fxch        st(1)<br />
004058F5  fsubrp      st(2),st<br />
004058F7  fld         dword ptr [ecx+14h]<br />
004058FA  fmul        st(1),st<br />
004058FC  fxch        st(1)<br />
004058FE  fstp        dword ptr [eax]<br />
00405900  fmulp       st(1),st<br />
00405902  fstp        dword ptr [eax+4]<br />
}<br />
00405905  add         esp,8<br />
00405908  ret         8<br />
</code><br />
Similar size, just doing the FP on the FP stack rather than in SIMD registers</p>
<p>Now look at the /clr version:<br />
<code><br />
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)<br />
{<br />
	Vector2	to_me = p_verlet->GetPos() - mp_other_verlet->GetPos();<br />
00000000  push        ebp<br />
00000001  mov         ebp,esp<br />
00000003  push        edi<br />
00000004  push        esi<br />
00000005  push        ebx<br />
00000006  sub         esp,94h<br />
0000000c  mov         esi,ecx<br />
0000000e  mov         edi,edx<br />
00000010  cmp         dword ptr ds:[006C2DC8h],0<br />
00000017  je          0000001E<br />
00000019  call        78DE2926<br />
0000001e  fldz<br />
00000020  fstp        dword ptr [ebp-10h]<br />
00000023  xor         ebx,ebx<br />
00000025  fldz<br />
00000027  fstp        dword ptr [ebp-14h]<br />
0000002a  fldz<br />
0000002c  fstp        dword ptr [ebp-18h]<br />
0000002f  fldz<br />
00000031  fstp        dword ptr [ebp-1Ch]<br />
00000034  fldz<br />
00000036  fstp        dword ptr [ebp-20h]<br />
00000039  fldz<br />
0000003b  fstp        dword ptr [ebp-24h]<br />
0000003e  xor         edx,edx<br />
00000040  mov         dword ptr [ebp-28h],edx<br />
00000043  xor         edx,edx<br />
00000045  mov         dword ptr [ebp-2Ch],edx<br />
00000048  fldz<br />
0000004a  fstp        dword ptr [ebp-30h]<br />
0000004d  fldz<br />
0000004f  fstp        dword ptr [ebp-34h]<br />
00000052  fldz<br />
00000054  fstp        dword ptr [ebp-38h]<br />
00000057  fldz<br />
00000059  fstp        dword ptr [ebp-3Ch]<br />
0000005c  fldz<br />
0000005e  fstp        dword ptr [ebp-40h]<br />
00000061  fldz<br />
00000063  fstp        dword ptr [ebp-44h]<br />
00000066  fldz<br />
00000068  fstp        dword ptr [ebp-48h]<br />
0000006b  fldz<br />
0000006d  fstp        dword ptr [ebp-4Ch]<br />
00000070  fldz<br />
00000072  fstp        dword ptr [ebp-50h]<br />
00000075  fldz<br />
00000077  fstp        dword ptr [ebp-54h]<br />
0000007a  mov         eax,dword ptr [esi+4]<br />
0000007d  mov         dword ptr [ebp-2Ch],eax<br />
00000080  mov         eax,dword ptr [ebp-2Ch]<br />
00000083  mov         dword ptr [ebp-28h],eax<br />
00000086  mov         eax,dword ptr [ebp-28h]<br />
00000089  fld         dword ptr [eax]<br />
0000008b  fstp        dword ptr [ebp+FFFFFF7Ch]<br />
00000091  mov         eax,dword ptr [ebp-28h]<br />
00000094  fld         dword ptr [eax+4]<br />
00000097  fstp        dword ptr [ebp-80h]<br />
0000009a  mov         eax,dword ptr [ebp+8]<br />
0000009d  fld         dword ptr [eax]<br />
0000009f  fstp        dword ptr [ebp-7Ch]<br />
000000a2  mov         eax,dword ptr [ebp+8]<br />
000000a5  fld         dword ptr [eax+4]<br />
000000a8  fstp        dword ptr [ebp-78h]<br />
000000ab  fld         dword ptr [ebp-7Ch]<br />
000000ae  fsub        dword ptr [ebp+FFFFFF7Ch]<br />
000000b4  fstp        dword ptr [ebp-24h]<br />
000000b7  fld         dword ptr [ebp-78h]<br />
000000ba  fsub        dword ptr [ebp-80h]<br />
000000bd  fstp        dword ptr [ebp-20h]<br />
000000c0  fld         dword ptr [ebp-24h]<br />
000000c3  fstp        dword ptr [ebp-74h]<br />
000000c6  fld         dword ptr [ebp-20h]<br />
000000c9  fstp        dword ptr [ebp-70h]<br />
	if (to_me.Length() < 0.000001)<br />
000000cc  fld         dword ptr [ebp-20h]<br />
000000cf  fmul        st,st(0)<br />
000000d1  fld         dword ptr [ebp-24h]<br />
000000d4  fmul        st,st(0)<br />
000000d6  faddp       st(1),st<br />
000000d8  fstp        dword ptr [ebp-54h]<br />
000000db  fld         dword ptr [ebp-54h]<br />
000000de  fsqrt<br />
000000e0  fstp        qword ptr [ebp+FFFFFF74h]<br />
000000e6  fld         qword ptr [ebp+FFFFFF74h]<br />
000000ec  fstp        dword ptr [ebp+FFFFFF68h]<br />
000000f2  fld         dword ptr [ebp+FFFFFF68h]<br />
000000f8  fstp        qword ptr [ebp+FFFFFF60h]<br />
000000fe  fld         qword ptr [ebp+FFFFFF60h]<br />
00000104  fld         qword ptr ds:[012AFBD0h]<br />
0000010a  fcomip      st,st(1)<br />
0000010c  fstp        st(0)<br />
0000010e  jp          0000011C<br />
00000110  jbe         0000011C<br />
	{<br />
		to_me = Vector2(1.0f,0.0f);<br />
00000112  fld1<br />
00000114  fstp        dword ptr [ebp-74h]<br />
00000117  fldz<br />
00000119  fstp        dword ptr [ebp-70h]<br />
	}<br />
	Vector2	mid = mp_other_verlet->GetPos() + to_me.Normal()*m_mid;<br />
0000011c  fld         dword ptr [ebp-70h]<br />
0000011f  fmul        st,st(0)<br />
00000121  fld         dword ptr [ebp-74h]<br />
00000124  fmul        st,st(0)<br />
00000126  faddp       st(1),st<br />
00000128  fstp        dword ptr [ebp-50h]<br />
0000012b  fld         dword ptr [ebp-50h]<br />
0000012e  fsqrt<br />
00000130  fstp        qword ptr [ebp+FFFFFF6Ch]<br />
00000136  fld         qword ptr [ebp+FFFFFF6Ch]<br />
0000013c  fstp        dword ptr [ebp-1Ch]<br />
0000013f  fld         dword ptr [ebp-1Ch]<br />
00000142  fldz<br />
00000144  fcomip      st,st(1)<br />
00000146  fstp        st(0)<br />
00000148  jp          00000168<br />
0000014a  jae         00000168<br />
0000014c  fld         dword ptr [ebp-1Ch]<br />
0000014f  fld1<br />
00000151  fdivrp      st(1),st<br />
00000153  fstp        dword ptr [ebp-18h]<br />
00000156  fld         dword ptr [ebp-18h]<br />
00000159  fmul        dword ptr [ebp-74h]<br />
0000015c  fstp        dword ptr [ebp-6Ch]<br />
0000015f  fld         dword ptr [ebp-18h]<br />
00000162  fmul        dword ptr [ebp-70h]<br />
00000165  fstp        dword ptr [ebp-68h]<br />
00000168  fld         dword ptr [esi+0Ch]<br />
0000016b  fstp        dword ptr [ebp-14h]<br />
0000016e  fld         dword ptr [ebp-6Ch]<br />
00000171  fmul        dword ptr [ebp-14h]<br />
00000174  fstp        dword ptr [ebp-4Ch]<br />
00000177  fld         dword ptr [ebp-68h]<br />
0000017a  fmul        dword ptr [ebp-14h]<br />
0000017d  fstp        dword ptr [ebp-48h]<br />
00000180  mov         ebx,dword ptr [ebp-2Ch]<br />
00000183  fld         dword ptr [ebx]<br />
00000185  fstp        dword ptr [ebp-64h]<br />
00000188  fld         dword ptr [ebx+4]<br />
0000018b  fstp        dword ptr [ebp-60h]<br />
0000018e  fld         dword ptr [ebp-64h]<br />
00000191  fadd        dword ptr [ebp-4Ch]<br />
00000194  fstp        dword ptr [ebp-44h]<br />
00000197  fld         dword ptr [ebp-60h]<br />
0000019a  fadd        dword ptr [ebp-48h]<br />
0000019d  fstp        dword ptr [ebp-40h]<br />
	Vector2	to_mid = mid-p_verlet->GetPos() ;<br />
000001a0  mov         eax,dword ptr [ebp+8]<br />
000001a3  fld         dword ptr [eax]<br />
000001a5  fstp        dword ptr [ebp-5Ch]<br />
000001a8  mov         eax,dword ptr [ebp+8]<br />
000001ab  fld         dword ptr [eax+4]<br />
000001ae  fstp        dword ptr [ebp-58h]<br />
000001b1  fld         dword ptr [ebp-44h]<br />
000001b4  fsub        dword ptr [ebp-5Ch]<br />
000001b7  fstp        dword ptr [ebp-3Ch]<br />
000001ba  fld         dword ptr [ebp-40h]<br />
000001bd  fsub        dword ptr [ebp-58h]<br />
000001c0  fstp        dword ptr [ebp-38h]<br />
	return to_mid*m_force;  //<br />
000001c3  fld         dword ptr [esi+14h]<br />
000001c6  fstp        dword ptr [ebp-10h]<br />
000001c9  fld         dword ptr [ebp-3Ch]<br />
000001cc  fmul        dword ptr [ebp-10h]<br />
000001cf  fstp        dword ptr [ebp-34h]<br />
000001d2  fld         dword ptr [ebp-38h]<br />
000001d5  fmul        dword ptr [ebp-10h]<br />
000001d8  fstp        dword ptr [ebp-30h]<br />
000001db  fld         dword ptr [ebp-34h]<br />
000001de  fstp        dword ptr [edi]<br />
000001e0  fld         dword ptr [ebp-30h]<br />
000001e3  fstp        dword ptr [edi+4]<br />
000001e6  mov         eax,edi<br />
000001e8  lea         esp,[ebp-0Ch]<br />
000001eb  pop         ebx<br />
000001ec  pop         esi<br />
000001ed  pop         edi<br />
000001ee  pop         ebp<br />
000001ef  ret         4<br />
</code></p>
<p>Even just glancing over it, it look like it would take twice as long, possible more depending on the memory cache situation.  Is this just bad CLR compiling, or am I doing something wrong?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
