<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Blob Physics</title>
	<atom:link href="http://cowboyprogramming.com/2007/01/05/blob-physics/feed/" rel="self" type="application/rss+xml" />
	<link>http://cowboyprogramming.com/2007/01/05/blob-physics/</link>
	<description>Game Development and General Hacking by the Old West</description>
	<lastBuildDate>Thu, 11 Mar 2010 16:10:33 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Omar</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-133632</link>
		<dc:creator>Omar</dc:creator>
		<pubDate>Fri, 09 Oct 2009 12:09:09 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-133632</guid>
		<description>Just stumbled on that site, nice articles. I programmed similar physics for a game called Soul Bubbles on the DS. Lots of moving blobs and geometry collisions (typical level 15k points), funny to fit into DS.

You said:
&quot;This causes the spring system to have a net force imbalance in a particular direction (depending on the order of update). The solution here was very simple. I just split the loop up into two separate loops: one to gather the forces, and then one to apply them in the integration step. This ensured that all forces were symmetrical.&quot;

I generally solved this kind of pertubation by inverting the processing order on each update loop (eg: frame n update points to 0 to max, frame n+1 update points max to 0). It is as cache-friendly but allow to avoid iterating the points more than one time for this kind of operation, and generally tends to stabilize the system.

The bubbles in Soul Bubbles are 32 verlet-integrated points to which are applied:
- A single layer of spring constraints ( point N to point N+1%32 ).
- Angle constraints to try to maintain an overall round shape. Applied by moving point N to tend to satisfy an ideal angle between N-1,N,N+1. I don&#039;t recall the details but this proven to be different from applying strings between points N-1 and N+1.
- A slight scaling up of all points from their average center to cope with external pressure when the surface became too small.
- In cases where bubbles gets into thin/sharp objects collisions responses are biased to push points back toward blob center, and friction increased for those points to reduce penetration.

Add in some thresholds to optimize idle cases and various common-sense low level optimization. In the end other components (such as managing the membranes between 2 collides bubbles) were more heavy.</description>
		<content:encoded><![CDATA[<p>Just stumbled on that site, nice articles. I programmed similar physics for a game called Soul Bubbles on the DS. Lots of moving blobs and geometry collisions (typical level 15k points), funny to fit into DS.</p>
<p>You said:<br />
&#8220;This causes the spring system to have a net force imbalance in a particular direction (depending on the order of update). The solution here was very simple. I just split the loop up into two separate loops: one to gather the forces, and then one to apply them in the integration step. This ensured that all forces were symmetrical.&#8221;</p>
<p>I generally solved this kind of pertubation by inverting the processing order on each update loop (eg: frame n update points to 0 to max, frame n+1 update points max to 0). It is as cache-friendly but allow to avoid iterating the points more than one time for this kind of operation, and generally tends to stabilize the system.</p>
<p>The bubbles in Soul Bubbles are 32 verlet-integrated points to which are applied:<br />
- A single layer of spring constraints ( point N to point N+1%32 ).<br />
- Angle constraints to try to maintain an overall round shape. Applied by moving point N to tend to satisfy an ideal angle between N-1,N,N+1. I don&#8217;t recall the details but this proven to be different from applying strings between points N-1 and N+1.<br />
- A slight scaling up of all points from their average center to cope with external pressure when the surface became too small.<br />
- In cases where bubbles gets into thin/sharp objects collisions responses are biased to push points back toward blob center, and friction increased for those points to reduce penetration.</p>
<p>Add in some thresholds to optimize idle cases and various common-sense low level optimization. In the end other components (such as managing the membranes between 2 collides bubbles) were more heavy.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-88673</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Sun, 22 Feb 2009 15:35:27 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-88673</guid>
		<description>Yeah, it probably would work for car tires.   However it would also be rather a waste of computing power (and programming time), as the effects would be too slight to see unless you had some kind of &quot;tire cam&quot;.  :)</description>
		<content:encoded><![CDATA[<p>Yeah, it probably would work for car tires.   However it would also be rather a waste of computing power (and programming time), as the effects would be too slight to see unless you had some kind of &#8220;tire cam&#8221;.  :)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-88583</link>
		<dc:creator>Chris</dc:creator>
		<pubDate>Sun, 22 Feb 2009 09:02:55 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-88583</guid>
		<description>Another awesome article, well done.  I think this might be good for car tires too (Probably with stiffer springs).  They could simulate nice bending, bulging and you could have a really simple traction policy too, traction = GetNumberOfPointsContactingRoad() / nNormalNumberPointsContactingRoad.  Keep up the good work.</description>
		<content:encoded><![CDATA[<p>Another awesome article, well done.  I think this might be good for car tires too (Probably with stiffer springs).  They could simulate nice bending, bulging and you could have a really simple traction policy too, traction = GetNumberOfPointsContactingRoad() / nNormalNumberPointsContactingRoad.  Keep up the good work.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: pixelalo.com</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-3668</link>
		<dc:creator>pixelalo.com</dc:creator>
		<pubDate>Mon, 10 Sep 2007 15:57:35 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-3668</guid>
		<description>&lt;strong&gt;Simulando &quot;blobs&quot; 2D...&lt;/strong&gt;

En la web de Mick West hay un interesante tutorial sobre cómo programar blobs (o &quot;goterones viscosos&quot;, a falta de un mejor sinónimo) como los que aparecen en juegos como Roco Loco....</description>
		<content:encoded><![CDATA[<p><strong>Simulando &quot;blobs&quot; 2D&#8230;</strong></p>
<p>En la web de Mick West hay un interesante tutorial sobre cómo programar blobs (o &quot;goterones viscosos&quot;, a falta de un mejor sinónimo) como los que aparecen en juegos como Roco Loco&#8230;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-44</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Fri, 09 Feb 2007 00:07:44 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-44</guid>
		<description>&lt;blockquote&gt;Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)&lt;/blockquote&gt;

Better ways to spend 90% of your time, that&#039;s for sure.  But I just made my application 8% faster with 15 minutes work (it went from 134 to 123 fps,  in non-simd, just with that one change!).  Remember what Knuth said about similar optimization:

&lt;blockquote&gt;The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their ‘optimized’ programs&lt;/blockquote&gt;


Your point on &quot;good&quot; SIMD is well taken.  I&#039;m from console-land, and if you are doing anything intensive with vectors like this, then you want to be using SIMD (or preferably the vector units).

It&#039;s a period of transition.  Managed code is great, but there&#039;s still an appropriate division between game code and engine code.   It&#039;s not a hard and fast division, and depends on the game (and several other factors), but it&#039;s still there.   As managed code gets better at targeting hardware, then it will get used more.</description>
		<content:encoded><![CDATA[<blockquote><p>Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)</p></blockquote>
<p>Better ways to spend 90% of your time, that&#8217;s for sure.  But I just made my application 8% faster with 15 minutes work (it went from 134 to 123 fps,  in non-simd, just with that one change!).  Remember what Knuth said about similar optimization:</p>
<blockquote><p>The conventional wisdom shared by many of today’s software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can’t debug or maintain their ‘optimized’ programs</p></blockquote>
<p>Your point on &#8220;good&#8221; SIMD is well taken.  I&#8217;m from console-land, and if you are doing anything intensive with vectors like this, then you want to be using SIMD (or preferably the vector units).</p>
<p>It&#8217;s a period of transition.  Managed code is great, but there&#8217;s still an appropriate division between game code and engine code.   It&#8217;s not a hard and fast division, and depends on the game (and several other factors), but it&#8217;s still there.   As managed code gets better at targeting hardware, then it will get used more.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Washu</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-43</link>
		<dc:creator>Washu</dc:creator>
		<pubDate>Thu, 08 Feb 2007 23:21:39 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-43</guid>
		<description>&lt;blockquote&gt;
And here’s the comparison.

http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html

And really I should be using the MD part of SIMD, which should be able to halve it again.

So my original example was somewhat misleading. However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting. 
&lt;/blockquote&gt;
Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)

What I do note is the lack of &#039;good&#039; SIMD in most applications. For instance, the lack of usage of parallelizing of operations. Instead of performing 1 dot product at a time using SIMD, do 4...as in below (note that I do not claim that you should use the code below, just that its typically a better parallelization than single operations)
&lt;code&gt;
; Given that XMM0 - XMM7 contain R4 vectors V0 - V7, such that we wish to
; calculate the inner product of  and return the results of
; all four inner products in a result vector.
mulps xmm0, xmm1
mulps xmm2, xmm3
haddps xmm0, xmm2

mulps xmm4, xmm5
mulps xmm6, xmm7
haddps xmm4, xmm6

haddps xmm0, xmm4
movaps result, xmm0
&lt;/code&gt;</description>
		<content:encoded><![CDATA[<blockquote><p>
And here’s the comparison.</p>
<p><a href="http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html" rel="nofollow">http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html</a></p>
<p>And really I should be using the MD part of SIMD, which should be able to halve it again.</p>
<p>So my original example was somewhat misleading. However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting.
</p></blockquote>
<p>Yes, I suppose if I really wanted to I could also sacrifice a huge amount of readability to hyperoptimize the managed version. Experience tells me that there are better ways to spend my time though. The 90/10 rule applies even here :)</p>
<p>What I do note is the lack of &#8216;good&#8217; SIMD in most applications. For instance, the lack of usage of parallelizing of operations. Instead of performing 1 dot product at a time using SIMD, do 4&#8230;as in below (note that I do not claim that you should use the code below, just that its typically a better parallelization than single operations)<br />
<code><br />
; Given that XMM0 - XMM7 contain R4 vectors V0 - V7, such that we wish to<br />
; calculate the inner product of  and return the results of<br />
; all four inner products in a result vector.<br />
mulps xmm0, xmm1<br />
mulps xmm2, xmm3<br />
haddps xmm0, xmm2</p>
<p>mulps xmm4, xmm5<br />
mulps xmm6, xmm7<br />
haddps xmm4, xmm6</p>
<p>haddps xmm0, xmm4<br />
movaps result, xmm0<br />
</code></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-42</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Thu, 08 Feb 2007 23:03:46 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-42</guid>
		<description>I wrote an &quot;old fashioned&quot; version for kicks:

&lt;code&gt;
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)
{	float v_x = p_verlet-&gt;GetPos().x;
	float v_y = p_verlet-&gt;GetPos().y;
	float o_x = mp_other_verlet-&gt;GetPos().x;
	float o_y = mp_other_verlet-&gt;GetPos().y;
	float t_x = (v_x - o_x);
	float t_y = (v_y - o_y);
	float t_len = sqrtf(t_x*t_x + t_y*t_y);
	if (t_len &lt;0.000001) // should be f, but leave it for comparison
	{
		t_x = 1.0f;
		t_y = 0.0f;
		t_len = 1.0f;
	}
	float mid = m_mid;
	float m_x = o_x + t_x/t_len*mid;
	float m_y = o_y + t_y/t_len*mid;
	float tm_x = m_x - v_x;
	float tm_y = m_y - v_y;
	float force = m_force;
	return Vector2(tm_x*force,tm_y*force);
}
&lt;/code&gt;

And here&#039;s the comparison.

http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html

And really I should be using the MD part of SIMD, which should be able to halve it again.

So my original example was somewhat misleading.  However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting.</description>
		<content:encoded><![CDATA[<p>I wrote an &#8220;old fashioned&#8221; version for kicks:</p>
<p><code><br />
Vector2	CSemiRigidConstraint::GetForce(CVerletPoint* p_verlet)<br />
{	float v_x = p_verlet->GetPos().x;<br />
	float v_y = p_verlet->GetPos().y;<br />
	float o_x = mp_other_verlet->GetPos().x;<br />
	float o_y = mp_other_verlet->GetPos().y;<br />
	float t_x = (v_x - o_x);<br />
	float t_y = (v_y - o_y);<br />
	float t_len = sqrtf(t_x*t_x + t_y*t_y);<br />
	if (t_len &lt;0.000001) // should be f, but leave it for comparison<br />
	{<br />
		t_x = 1.0f;<br />
		t_y = 0.0f;<br />
		t_len = 1.0f;<br />
	}<br />
	float mid = m_mid;<br />
	float m_x = o_x + t_x/t_len*mid;<br />
	float m_y = o_y + t_y/t_len*mid;<br />
	float tm_x = m_x - v_x;<br />
	float tm_y = m_y - v_y;<br />
	float force = m_force;<br />
	return Vector2(tm_x*force,tm_y*force);<br />
}<br />
</code></p>
<p>And here's the comparison.</p>
<p><a href="http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html" rel="nofollow">http://cowboyprogramming.com/code/ComparisonOfManagedOptimizations2.html</a></p>
<p>And really I should be using the MD part of SIMD, which should be able to halve it again.</p>
<p>So my original example was somewhat misleading.  However, when you look at actual engine code, it WILL be optimized to take advantage of the processor architecture, and I suspect based on that then the speed comparisons will be similar to what I was getting.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Washu</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-40</link>
		<dc:creator>Washu</dc:creator>
		<pubDate>Thu, 08 Feb 2007 22:41:20 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-40</guid>
		<description>&lt;blockquote&gt;
And now I’ve seen your whole code (nice comparison BTW), it’s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting. Whereas the C++ version just calls sqrtf, ect. I don’t know what’s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.&lt;/blockquote&gt;
Indeed, there are ways to get around that too, ways that will generate better/faster code, ways that a professional application would use...such as ngen. Using that pre-JIT the program on the target machine during installation will allow it more time to optimize. How much better it will be, I can&#039;t be sure.

&lt;blockquote&gt;But how fast does it run?&lt;/blockquote&gt;
Faster, I don&#039;t have the entire sample done, and probably am not going to complete it either (I have other commitments) but profiling suggests that it is about 5-10% slower than yours (without SIMD).

Which isn&#039;t bad, it is not great, but overall that’s a significant performance boost of compiling with just /clr. Still won’t be able to write a HL2 killer in it, at least not yet. What is interesting are the future JIT compilers that are being invested in by Microsoft. JIT compilers that can optimize based on the machine configuration, something that current statically built applications cannot do. As an example, if you want to use SIMD, you typically have to build in several code-paths. Ones that can use SIMD, and ones that can’t. You then decide (at runtime) which path to take depending on the available features. The disadvantage here is a single level of indirection, and a whole hell of a lot of code on the developer&#039;s part. A well written inner product can easily outperform the SSE generated by VSTS (Visual Studio Team System). Compilers just aren’t good at vectorization, even ones like the Intel compiler (which produces bugs when used, as translation of code to a vectorized format inherently changes the behavior of the application in unpredictable ways.

A more advanced JIT will be able to target the processor that the machine is running on, including hardware extensions, when the application is launched. This presents an opportunity for extreme runtime optimizations based on extended instruction sets. The JIT will still be constrained to a shorter running time than a static compiler, but using tricks like NGEN, you will be able to really optimize it to a great extent.</description>
		<content:encoded><![CDATA[<blockquote><p>
And now I’ve seen your whole code (nice comparison BTW), it’s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting. Whereas the C++ version just calls sqrtf, ect. I don’t know what’s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.</p></blockquote>
<p>Indeed, there are ways to get around that too, ways that will generate better/faster code, ways that a professional application would use&#8230;such as ngen. Using that pre-JIT the program on the target machine during installation will allow it more time to optimize. How much better it will be, I can&#8217;t be sure.</p>
<blockquote><p>But how fast does it run?</p></blockquote>
<p>Faster, I don&#8217;t have the entire sample done, and probably am not going to complete it either (I have other commitments) but profiling suggests that it is about 5-10% slower than yours (without SIMD).</p>
<p>Which isn&#8217;t bad, it is not great, but overall that’s a significant performance boost of compiling with just /clr. Still won’t be able to write a HL2 killer in it, at least not yet. What is interesting are the future JIT compilers that are being invested in by Microsoft. JIT compilers that can optimize based on the machine configuration, something that current statically built applications cannot do. As an example, if you want to use SIMD, you typically have to build in several code-paths. Ones that can use SIMD, and ones that can’t. You then decide (at runtime) which path to take depending on the available features. The disadvantage here is a single level of indirection, and a whole hell of a lot of code on the developer&#8217;s part. A well written inner product can easily outperform the SSE generated by VSTS (Visual Studio Team System). Compilers just aren’t good at vectorization, even ones like the Intel compiler (which produces bugs when used, as translation of code to a vectorized format inherently changes the behavior of the application in unpredictable ways.</p>
<p>A more advanced JIT will be able to target the processor that the machine is running on, including hardware extensions, when the application is launched. This presents an opportunity for extreme runtime optimizations based on extended instruction sets. The JIT will still be constrained to a shorter running time than a static compiler, but using tricks like NGEN, you will be able to really optimize it to a great extent.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Washu</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-39</link>
		<dc:creator>Washu</dc:creator>
		<pubDate>Thu, 08 Feb 2007 22:27:16 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-39</guid>
		<description>Actually, its not so much to hack it to make the assembly nice. Its a difference in how various types are treated.

ValueType&#039;s in .Net are a copy based mechanism, much like they are in other languages. Anything marked with the struct type in C# is considered to be a value type. Making it a class will not solve the problem though (actually, it makes it worse, but I&#039;ll detail why in a bit). Since value-types are primarily a copy mechanism (the old immutable idea) then you&#039;ve got to account for that in your code. Things like taking a reference to the value type (pass by-ref or as an out parameter) enable the CLR to realize how you are attempting to use these various objects, enabling it to eliminate redundant copies and the like. The JIT has very little time to run in, and so its not going to make the best decisions always (it does make a lot of smart ones though).

Classes don&#039;t solve the problem because they hare heap only. You cannot allocate a non-heap based class, as such the allocation of a class type will lead to a GC eventually. Now, Gen0 collections are INSANELY fast, but if you&#039;re pumping out a lot of short lived temporaries, what will happen is that some objects will live a bit longer, and get pushed into the Gen1 collection. Then when the Gen1 collection gets full, it will also be collected, and those short lived temps will be released, but any that live just a bit longer could end up being pushed up into gen2. Gen2 takes a long time to free up, especially with the LOH being up there in the Gen2 as well. So obviously short lived temporaries can cost you...if you&#039;re not careful.

But, its not all horror stories, a gen0 collection is INSANELY fast, as I mentioned above, a wee bit of profiling on my part found it to be faster than a C++ allocation on a moderately fragmented heap. Since most C++ allocators use a heap walk to find a free chunk of memory to allocation (this is unspecified in the standard, as such not all implementations have to behave this way), that traversal can become quite expensive. With managed languages however, an allocation is a constant time operation. The GC will typically do a sweep and compact when a gen2 or gen1 collection happens, but in general, gen0 collections rarely have such actions. Plus, the GC typically doesn&#039;t have to freeze your application[1].

You have to watch out for finalizers though, since destruction is not a deterministic action, finalizers can cause short lived elements to be pushed into the Gen1 heap, when they are really just waiting to be finalized.

[1] &lt;a href=&quot;http://blogs.msdn.com/maoni/archive/2004/06/15/156626.aspx&quot; rel=&quot;nofollow&quot;&gt;Using GC Efficiently – Part 1&lt;/a&gt;
    &lt;a href=&quot;http://blogs.msdn.com/maoni/archive/2004/09/25/234273.aspx&quot; rel=&quot;nofollow&quot;&gt;Using GC Efficiently – Part 2&lt;/a&gt;
    &lt;a href=&quot;http://blogs.msdn.com/maoni/archive/2004/12/19/327149.aspx&quot; rel=&quot;nofollow&quot;&gt;Using GC Efficiently – Part 3&lt;/a&gt;

[2] &lt;a href=&quot;http://blogs.msdn.com/maoni/archive/2004/11/04/252697.aspx&quot; rel=&quot;nofollow&quot;&gt;Clearing up some confusion over finalization and other areas in GC&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>Actually, its not so much to hack it to make the assembly nice. Its a difference in how various types are treated.</p>
<p>ValueType&#8217;s in .Net are a copy based mechanism, much like they are in other languages. Anything marked with the struct type in C# is considered to be a value type. Making it a class will not solve the problem though (actually, it makes it worse, but I&#8217;ll detail why in a bit). Since value-types are primarily a copy mechanism (the old immutable idea) then you&#8217;ve got to account for that in your code. Things like taking a reference to the value type (pass by-ref or as an out parameter) enable the CLR to realize how you are attempting to use these various objects, enabling it to eliminate redundant copies and the like. The JIT has very little time to run in, and so its not going to make the best decisions always (it does make a lot of smart ones though).</p>
<p>Classes don&#8217;t solve the problem because they hare heap only. You cannot allocate a non-heap based class, as such the allocation of a class type will lead to a GC eventually. Now, Gen0 collections are INSANELY fast, but if you&#8217;re pumping out a lot of short lived temporaries, what will happen is that some objects will live a bit longer, and get pushed into the Gen1 collection. Then when the Gen1 collection gets full, it will also be collected, and those short lived temps will be released, but any that live just a bit longer could end up being pushed up into gen2. Gen2 takes a long time to free up, especially with the LOH being up there in the Gen2 as well. So obviously short lived temporaries can cost you&#8230;if you&#8217;re not careful.</p>
<p>But, its not all horror stories, a gen0 collection is INSANELY fast, as I mentioned above, a wee bit of profiling on my part found it to be faster than a C++ allocation on a moderately fragmented heap. Since most C++ allocators use a heap walk to find a free chunk of memory to allocation (this is unspecified in the standard, as such not all implementations have to behave this way), that traversal can become quite expensive. With managed languages however, an allocation is a constant time operation. The GC will typically do a sweep and compact when a gen2 or gen1 collection happens, but in general, gen0 collections rarely have such actions. Plus, the GC typically doesn&#8217;t have to freeze your application[1].</p>
<p>You have to watch out for finalizers though, since destruction is not a deterministic action, finalizers can cause short lived elements to be pushed into the Gen1 heap, when they are really just waiting to be finalized.</p>
<p>[1] <a href="http://blogs.msdn.com/maoni/archive/2004/06/15/156626.aspx" rel="nofollow">Using GC Efficiently – Part 1</a><br />
    <a href="http://blogs.msdn.com/maoni/archive/2004/09/25/234273.aspx" rel="nofollow">Using GC Efficiently – Part 2</a><br />
    <a href="http://blogs.msdn.com/maoni/archive/2004/12/19/327149.aspx" rel="nofollow">Using GC Efficiently – Part 3</a></p>
<p>[2] <a href="http://blogs.msdn.com/maoni/archive/2004/11/04/252697.aspx" rel="nofollow">Clearing up some confusion over finalization and other areas in GC</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mick West</title>
		<link>http://cowboyprogramming.com/2007/01/05/blob-physics/comment-page-1/#comment-38</link>
		<dc:creator>Mick West</dc:creator>
		<pubDate>Thu, 08 Feb 2007 22:21:53 +0000</pubDate>
		<guid isPermaLink="false">http://cowboyprogramming.com/?p=35#comment-38</guid>
		<description>And now I&#039;ve seen your whole code (nice comparison BTW), it&#039;s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting.  Whereas the C++ version just calls sqrtf, ect.  I don&#039;t know what&#039;s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.  

But how fast does it run?</description>
		<content:encoded><![CDATA[<p>And now I&#8217;ve seen your whole code (nice comparison BTW), it&#8217;s still 10% longer than the non-SIMD version, plus it calls functions for Magnitude, adding and subtracting.  Whereas the C++ version just calls sqrtf, ect.  I don&#8217;t know what&#8217;s in those functions, but I highly suspect at least another ten lines of asm per function, possibly more.  </p>
<p>But how fast does it run?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
