<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-6042417775578107106</id><updated>2012-01-28T05:49:20.834-05:00</updated><category term='GIS'/><category term='Visual Studio'/><category term='Wordpress'/><category term='Performance'/><category term='SQL'/><category term='Email'/><category term='SSE'/><category term='C'/><category term='WinDBG'/><category term='Installers'/><category term='X-Plane Scenery Tools'/><category term='Windows'/><category term='algorithms'/><category term='Objective C'/><category term='NVidia'/><category term='OS X'/><category term='iphone'/><category term='X-Plane'/><category term='X-Code'/><category term='OpenAL'/><category term='Networking'/><category term='Debugging'/><category term='Humor'/><category term='Android'/><category term='Services'/><category term='c++'/><category term='Computational Geometry'/><category term='OpenGL'/><category term='Heap'/><category term='Threading'/><category term='Macintosh'/><category term='Quotes'/><category term='iis'/><category term='CSS'/><category term='MediaWiki'/><category term='Software Development'/><category term='CVS Voodoo'/><category term='CVS'/><category term='Design'/><category term='XML'/><category term='Modeling'/><category term='COM'/><category term='Tips'/><category term='Memory Management'/><category term='Game Development'/><category term='Unicode'/><category term='CGAL'/><category term='Rants'/><category term='GDB'/><category term='STL'/><category term='Linux'/><category term='Bonjour'/><category term='mod_rewrite'/><category term='GLSL'/><category term='ZeroConf'/><category term='OSM'/><title type='text'>The Hacks of Life</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default?start-index=101&amp;max-results=100'/><author><name>Chris</name><uri>http://www.blogger.com/profile/14648675681957285299</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='26' src='http://www.cjserio.com/blogger/uploaded_images/Chris.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>266</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-4627904813518020998</id><published>2011-12-13T15:40:00.000-05:00</published><updated>2011-12-13T15:40:00.310-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Stencil Optimization for Deferred Lights Without Depth Clamp</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Using two sided stencil volumes to improve fill rate with deferred lights is not new; I'll write more if anyone wants, but this is all stuff I got off of the interwebs.&amp;nbsp; The high level summary:&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;To save fill rate when drawing deferred lights, we want to draw a geometric shape to the screen that covers as few pixels as possible - preferably only the ones that will be lit.&lt;/li&gt;&lt;li&gt;Typically this is done using either a billboard or a bounding volume around the light.&amp;nbsp; X-Plane 10 uses this second option, using a cube for omnidirectional lights and a quad pyramid for directional lights. (This is a trade-off of bounding volume accuracy for vertex count.)&lt;/li&gt;&lt;li&gt;If we have manifold bounding volumes, we can select only the fragments inside the volumes using a standard two-sided stenciling trick: we set the back face stencil mode to increment on depth fail and the front face stencil mode to decrement on depth fail - both with wrapping.&amp;nbsp; The result is that only screen-space pixels that contain geometry inside the volume (causing a depth fail on the back face but not the front face) have an odd number of selections.&lt;/li&gt;&lt;li&gt;Once we have our stencil buffer, we can simply render our manifold volumes with stencil test to discard fragments when our more expensive lighting shader is bound.&lt;/li&gt;&lt;/ul&gt;So far, all standard.&amp;nbsp; Now what happens when the near and far clip planes interfere with our bounding volume?&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;If the front of the volume intersects the near clip plane, that's no problem - the front facing geometry isn't drawn, but since there was no geometry in front of our light volume (how could there be - it would also be on the wrong side of the near clip plane too) this is okay.&lt;/li&gt;&lt;li&gt;We need to render the back face only of our volume to correctly rasterize the entire light.&amp;nbsp; If we rasterize the front, we'll draw nothing when the camera is inside the light volume, which is bad.&amp;nbsp; (This need to handle being inside the shadow volume gracefully is why &lt;a href="http://en.wikipedia.org/wiki/Shadow_volume#Depth_fail"&gt;Carmack's Reverse&lt;/a&gt; is useful.)&lt;/li&gt;&lt;/ul&gt;If the back of the volume intersects the far clip plane, we have a bunch of problems though.&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;When drawing the actual light volume, we're going to lose a bunch of our screen-space coverage, and the light will be missing.&lt;/li&gt;&lt;li&gt;When we're using stenciling, the increment/decrement pattern will be broken.&amp;nbsp; If we have geometry in front of the entire light, it will end up off-by-one in its surface count.&amp;nbsp; This in turn can interfere with other lights that cover the same screen space.&lt;/li&gt;&lt;/ul&gt;This last case shows up as a really weird looking bug in X-Plane: when the landing light is on and pokes out the far clip plane, we can get a cut-out that &lt;i&gt;removes&lt;/i&gt; other area lights that cover the screen-space intersection of the landing light volume and the far clip plane.&lt;br /&gt;&lt;br /&gt;The simple solution is obvous: use &lt;a href="http://www.opengl.org/registry/specs/ARB/depth_clamp.txt"&gt;GL_depth_clamp&lt;/a&gt; to the near and far clip planes instead of clipping.&amp;nbsp; But what if you don't &lt;i&gt;have&lt;/i&gt; this extension?&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-JPPsZZlAIJc/Tua9NB1czgI/AAAAAAAAA3A/Allr7orNGs8/s1600/clipped_1.png" imageanchor="1" style="float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;/a&gt;&lt;a href="http://1.bp.blogspot.com/-Dx6O15ASNpQ/Tua9MsAqhII/AAAAAAAAA24/qSWoKsN_wSQ/s1600/clipped_2.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"&gt;&lt;img border="0" height="150" src="http://1.bp.blogspot.com/-Dx6O15ASNpQ/Tua9MsAqhII/AAAAAAAAA24/qSWoKsN_wSQ/s200/clipped_2.png" width="200" /&gt;&lt;/a&gt;&lt;img border="0" height="150" src="http://1.bp.blogspot.com/-JPPsZZlAIJc/Tua9NB1czgI/AAAAAAAAA3A/Allr7orNGs8/s200/clipped_1.png" width="200" /&gt;&lt;a href="http://2.bp.blogspot.com/-Ag04dMrOVEE/Tua9MXTkihI/AAAAAAAAA2w/WHvAWgC1rKs/s1600/clipped_far.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="150" src="http://2.bp.blogspot.com/-Ag04dMrOVEE/Tua9MXTkihI/AAAAAAAAA2w/WHvAWgC1rKs/s200/clipped_far.png" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In these pictures, we are seeing only a close rendering of the P180 - the far clip plane is just off the end of the airplane.&amp;nbsp; The red cone extending from the tail is the pyramid light volume for the tail light that is shining out from the tail - it illuminates the top of the plane.&lt;br /&gt;&lt;br /&gt;In the three pictures the far clip plane is progressively moved farther away.&amp;nbsp; The lighter colored square is the &lt;i&gt;missing&lt;/i&gt; geometry - since the pyramid is clipped, you're seeing only the top and sides of the pyramid but not the base.&amp;nbsp; This is the area that will not be correctly stencil counted or rasterized.&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-FT49azAFczM/Tua9LpJybYI/AAAAAAAAA2g/4VWbYi5f0fc/s1600/half_lit.png" imageanchor="1" style="float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="150" src="http://2.bp.blogspot.com/-FT49azAFczM/Tua9LpJybYI/AAAAAAAAA2g/4VWbYi5f0fc/s200/half_lit.png" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/-T7sbOAlocP4/Tua9MGNirpI/AAAAAAAAA2o/ishAn4fG7SE/s1600/missing_back.png" imageanchor="1" style="float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="150" src="http://3.bp.blogspot.com/-T7sbOAlocP4/Tua9MGNirpI/AAAAAAAAA2o/ishAn4fG7SE/s200/missing_back.png" width="200" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Here we can see why this is a problem.&amp;nbsp; Note the vertical line where the back face is missing.&amp;nbsp; When we actually rasterize, we don't get any light spill - the result is a vertical clip in our light, visible on the top of the fuselage.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;If depth clamp isn't available, one alternative is to restrict the Z position of each bounding volume vertex in clip space.&amp;nbsp; This can be done in the vertex shader with something like:&lt;br /&gt;&lt;blockquote class="tr_bq"&gt;gl_Position.z = clamp(gl_Position.z, gl_Position.w,-gl_Position.w);&lt;br /&gt;&lt;/blockquote&gt;(W tends to negative for standard glFrustum matrices.)&lt;br /&gt;&lt;br /&gt;What's nice about this hack is that it is entirely in vertex shader, which means that we don't do anything that could inhibit the GPU's ability to do early or optimized Z culling.&lt;br /&gt;&lt;br /&gt;The actual screen-space position of the view volume &lt;i&gt;does not change&lt;/i&gt;.&amp;nbsp; This is because the position edit is done in clip space, and clip space is orthographic - X and Y turn into raster positions and Z into a depth position.&amp;nbsp; The "perspective" is created by dividing X and Y by W - we're free to completely whack Z without deforming the geometry as long as we are post-frustum-transform.&lt;br /&gt;&lt;br /&gt;Wel, not completely free.&amp;nbsp; There is one hitch: the actual Z test is no longer correct.&amp;nbsp; Observe these two pictures:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-v9xlKnnq1rU/Tua9KwSrzaI/AAAAAAAAA2I/UZt2CXR6z2c/s1600/good_z.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="150" src="http://1.bp.blogspot.com/-v9xlKnnq1rU/Tua9KwSrzaI/AAAAAAAAA2I/UZt2CXR6z2c/s200/good_z.png" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;&amp;nbsp;In the first picture, we see the correct Z intersection of the view volume with the fuselage.&amp;nbsp; (This picture is normal rendering with a close far clip plane, hence the lack of a pyramid back.)&amp;nbsp; The area of the fuselage that is not red is outside the light bounding volume, and there is just therefore just no need to shade it.&lt;br /&gt;&lt;br /&gt;Now look at the second picture - this is with Z clamping in the vertex shader.&amp;nbsp; Because the Z position has been clamped pre-interpolation, the Z fragment positions of any face that partly extended outside the clip planes will be &lt;b&gt;wrong&lt;/b&gt;!&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-qHbcOSjy43U/Tua9KaCGf9I/AAAAAAAAA2A/uSMWnIlLhTc/s1600/bad_z.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="150" src="http://4.bp.blogspot.com/-qHbcOSjy43U/Tua9KaCGf9I/AAAAAAAAA2A/uSMWnIlLhTc/s200/bad_z.png" width="200" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;In the picture we see this in the form of incorrect volume intersection.&amp;nbsp; Because the far end of the pyramid has been moved closer to us (to keep it inside the far clip plane) the fragments of the entire pyramid are too close to us - almost like a poor-man's polygon offset . The result is that more of the fuselage has turned red - that is, the Z test is wrong.&amp;nbsp; The actual Z error will sometimes reject pixels and sometimes accept pixels, depending on the precise interaction of the view volume and the clip planes.&lt;br /&gt;&lt;br /&gt;The net result is this: we can hack the Z coordinate in the vertex shader to guarantee complete one-sided rasterization of our view volume even with tight clip planes and no depth clamp, but we cannot combine this hack with a stencil test because the stencil test uses depth fail and our depth results are wrong.&lt;br /&gt;&lt;br /&gt;Thus the production path for X-Plane is this:&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;In the "big" world we use two-sided stenciling.&lt;/li&gt;&lt;li&gt;In the "small" world if we have depth clamp we use two-sided stenciling and depth clamp.&lt;/li&gt;&lt;li&gt;In the "small" world if we don't have depth clamp we use vertex-shader clamping and skip stenciling.&lt;/li&gt;&lt;/ul&gt;*This is actually a real question for X-Plane 10 running on OS X 10.6.8; the ATI drivers don't support the extension and in X-Plane we don't want to push out the far clip plane for the in-cockpit render.*&amp;nbsp; Is there any other way?* The truth is, the motivation to keep the far clip plane close is mostly a software-organizational one - the sim could run with a farther far clip plane but a lot of code uses the real view frustum and would have to be special-cased to maintain efficiency.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-4627904813518020998?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/4627904813518020998/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/12/stencil-optimization-for-deferred.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4627904813518020998'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4627904813518020998'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/12/stencil-optimization-for-deferred.html' title='Stencil Optimization for Deferred Lights Without Depth Clamp'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-Dx6O15ASNpQ/Tua9MsAqhII/AAAAAAAAA24/qSWoKsN_wSQ/s72-c/clipped_2.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-1235395280447638425</id><published>2011-09-03T14:10:00.000-04:00</published><updated>2011-09-03T14:10:56.500-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Computational Geometry'/><category scheme='http://www.blogger.com/atom/ns#' term='X-Plane Scenery Tools'/><title type='text'>Bezier Curve Optimization</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;I've been meaning to write up a few notes about how we turn &lt;a href="http://www.openstreetmap.org/"&gt;OSM&lt;/a&gt; vector data into bezier curves for X-Plane.&amp;nbsp; The code is all open source - look in the scenery tools &lt;a href="http://dev.x-plane.com/cgit/"&gt;web code browser&lt;/a&gt; at &lt;a href="http://dev.x-plane.com/cgit/cgit.cgi/xptools.git/tree/src/XESCore/NetPlacement.cpp%20"&gt;NetPlacement.cpp&lt;/a&gt; and &lt;a href="http://dev.x-plane.com/cgit/cgit.cgi/xptools.git/tree/src/XESCore/BezierApprox.cpp"&gt;BezierApprox.cpp&lt;/a&gt; for the code.&lt;br /&gt;&lt;br /&gt;First, a few basics:&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;OSM vector data comes as a polyline data - that is, each road is a series of points connected by straight line segments.&amp;nbsp; There are a &lt;i&gt;lot&lt;/i&gt; of points - sometimes every few meters along a road.&lt;/li&gt;&lt;li&gt;X-Plane 10 uses piece-wise bezier curves, meaning a road is a string of end-to-end bezier curves.&amp;nbsp; Each curve can be a line segment, quadratic, or cubic bezier curver, but not anything of higher degree.&amp;nbsp;&amp;nbsp;&lt;/li&gt;&lt;li&gt;The representation in X-Plane for the piece-wise bezier curves is a list of "tagged" points, where the tag defines whether a point is a piece-wise curve end-point or control point.&amp;nbsp; Semantically, the two end points must not be control points (must not be tagged) and we can never have more than two consecutive control points (because that would define a higher-order bezier).&lt;/li&gt;&lt;li&gt;There is no requirement that the curve be smooth - we can create a sharp corner at any non-control point, even between two bezier curves.&lt;/li&gt;&lt;/ul&gt;Both OSM and X-Plane's road system have network topology, but for the purpose of converting a polyline to a piece-wise bezier curve we care only about a single road between junctions - that is, a "curve" in topology-terms.&lt;br /&gt;&lt;br /&gt;We divide the problem into two parts: converting the polyline to a piece-wise bezier curve and optimizing that bezier curve to reduce point count.&lt;br /&gt;&lt;br /&gt;To build the initial beziers we take an idea from ATI's &lt;a href="http://www.google.com/url?sa=t&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CBYQFjAA&amp;amp;url=http%3A%2F%2Falex.vlachos.com%2Fgraphics%2FCurvedPNTriangles.pdf&amp;amp;rct=j&amp;amp;q=ATI%20PN-triangles&amp;amp;ei=wlJiTq3rH8PLgQetn-2JCg&amp;amp;usg=AFQjCNHxphyL1VbOOweccvXQp-iWecI9MA&amp;amp;cad=rja"&gt;PN-Triangles&lt;/a&gt; paper. The basic idea from the paper is this: if we have a poly-line (or triangle mesh) approximation of a curved surface, we can estimate the tangents at the vertices by averaging the direction of all incident linear components.&amp;nbsp; With the tangents at the vertices, we can then construct a bezier surface through that tangent (because a bezier curve's tangent at its end point runs toward the next control point) and use that to "round out" the mesh.&lt;br /&gt;&lt;br /&gt;The idea is actually a lot easier to understand for polylines, where the dimensionality is lower.&amp;nbsp; For each vertex in our road, we'll find the "average" direction of the road (the tangent line) at that vertex based on the two segments coming into that vertex.&amp;nbsp; The bezier control points adjacent to the vertex must run along that tangent line; we can then adjust the distance of the control points from the end point to control the amount of "bulge".&lt;br /&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-noh4LnJOKjg/TmJcOLmW_KI/AAAAAAAAA1I/qWHc2mTR3hU/s1600/1_orig_poly_line.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-noh4LnJOKjg/TmJcOLmW_KI/AAAAAAAAA1I/qWHc2mTR3hU/s1600/1_orig_poly_line.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;We start with a poly-line.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-vavmtnzijFI/TmJcOU64_YI/AAAAAAAAA1M/g95QClS_4Ag/s1600/2_tangent_vectors.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-vavmtnzijFI/TmJcOU64_YI/AAAAAAAAA1M/g95QClS_4Ag/s1600/2_tangent_vectors.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;We calculate the tangents at each vertex.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-996oL2gLjgo/TmJcOmjJAaI/AAAAAAAAA1Q/PU0Z_tRV-2M/s1600/3_control_ts.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-996oL2gLjgo/TmJcOmjJAaI/AAAAAAAAA1Q/PU0Z_tRV-2M/s1600/3_control_ts.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;We place bezier control points along the tangents at fixed fractional lengths.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-5ohUTYbzAkQ/TmJcO5ktT-I/AAAAAAAAA1U/7P-UKP15VgY/s1600/4_smooth_curve.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-5ohUTYbzAkQ/TmJcO5ktT-I/AAAAAAAAA1U/7P-UKP15VgY/s1600/4_smooth_curve.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;The result is a smooth piece-wise approximation through every point.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;PN triangles tends to make meshes "bulge", because a curve around a convex hull always extends outward.&amp;nbsp; You can see the same look in our interpolation.&amp;nbsp; This is good for on-ramps but actually looks quite bad for a straight-away road - the road "bulges out" when the curve ends.&lt;br /&gt;&lt;br /&gt;To address this, we categorize a road as "straight" if the road is long enough that &lt;i&gt;if&lt;/i&gt; we built a curve out of it, the radius of that curve would be larger than a constant value.&amp;nbsp; (We pick different constants for different types of roads.)&amp;nbsp; In other words, if two highway segments are each 1 km long and they meet at a 3 degree angle, we do not assume it is part of an arc with a 19 km radius - we assume that &lt;i&gt;most&lt;/i&gt; of the 1 km road are straight, with a small curve (of much smaller radius) at the end.&amp;nbsp; For any given point, we can decide whether either one or both of the two adjoining line segments is "straight" or should be entirely curved.&amp;nbsp; We then form five cases:&lt;br /&gt;&lt;ol style="text-align: left;"&gt;&lt;li&gt;If two segments come together at a very sharp angle, we simply keep the sharp angle.&amp;nbsp; We assume that if the data had this sharp angle in the original vector data (which is quite highly noded) then there really is some kind of sharp corner.&lt;/li&gt;&lt;li&gt;If the two segments come together at a very shallow angle, we simply keep the corner, because who cares.&amp;nbsp; This case matters when we have a very tiny angle (e.g. 0.25 degrees) but very long line segments, such that removing the tiny angle would cause significant change in the vector position due to the long "arm" and not the steep angle.&amp;nbsp; We trust that for our app the tiny curve isn't going to be visible.&lt;/li&gt;&lt;li&gt;If the two segments are both curved, we use the PN-triangle-style tangents as usual.&lt;/li&gt;&lt;li&gt;If one of the segments is curved and one is straight, the tangent at the point comes from the straight curve.&amp;nbsp; This takes the "bulge" out of the beginning and ending of our curves by ensuring that the curve ends by heading "into" the straight-away.&lt;/li&gt;&lt;li&gt;If both segments are straight, we need to round the corner on the inside.&amp;nbsp; We do this by pulling back the corner along both curves and using the original point as a quadratic bezier control point.&lt;/li&gt;&lt;/ol&gt;One note about cases 2 and 5: if you pull back a vertex proportional to the angle of the vertex, then very small angles can result in very small pull-backs, resulting in a very short curve.&amp;nbsp; In the case of X-Plane, we want to ensure that there is a minimum distance between points (so that rounding errors in the DSF encode don't give us colocated points); this means that the minimum angle for case 2 has to be large enough to prevent a "tiny curve" in case 5.&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-MueO7w-YWcM/TmJj3BmSc-I/AAAAAAAAA1o/zHLHk165lzI/s1600/curve_cases.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-MueO7w-YWcM/TmJj3BmSc-I/AAAAAAAAA1o/zHLHk165lzI/s1600/curve_cases.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;The five curve cases, illustrated.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;div&gt;With these five curve cases we get pretty good looking curved roads.&amp;nbsp; But our point gets out of control - at a minimum we've kept every original point, and on top of that we've added one or two bezier contrl points per segment.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What we need to do is generalize our curves.&amp;nbsp; Again, the PN-triangles observation can help us.&amp;nbsp; If we want to replace two piece-wise bezier curves with a single one, we know this: the tangent at the end of the curves can't change.&amp;nbsp; This means that the two control points of the approximate curve must be colinear with the control points of the original curve ends and the original curve ends itself.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So what?&amp;nbsp; Well, if we can only move the control points "in and out" then there are really only two scalar variables for &lt;i&gt;all&lt;/i&gt; possible approximations: how much to scale the control handles at the start and end.&amp;nbsp; And that's something we can check with brute force!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;div&gt;Below is the basic step to approximating a piece-wise bezier curve with two pieces as a single cubic bezier.&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-0Iw7xi1N5c8/TmJcPMPBj7I/AAAAAAAAA1Y/ikHsZJWXsDI/s1600/5_many_curves.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-0Iw7xi1N5c8/TmJcPMPBj7I/AAAAAAAAA1Y/ikHsZJWXsDI/s1600/5_many_curves.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;We start with a piece-wise bezier with nodes and control points.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://4.bp.blogspot.com/-qgnAqsJkzLk/TmJcPErV76I/AAAAAAAAA1c/n9Ow22ny7b0/s1600/6_push_tangents.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://4.bp.blogspot.com/-qgnAqsJkzLk/TmJcPErV76I/AAAAAAAAA1c/n9Ow22ny7b0/s1600/6_push_tangents.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;For each range of curves that we will simplify, we "push" the outermost control points along the tangent vector by an arbitrary scaling factor.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-Ir2PdEolTDg/TmJcPW0gEHI/AAAAAAAAA1g/S1IfySQx05k/s1600/7_near_curve.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-Ir2PdEolTDg/TmJcPW0gEHI/AAAAAAAAA1g/S1IfySQx05k/s1600/7_near_curve.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;The resulting curve will be close, but not quite the same as the original.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-WDVSfFE-9NM/TmJcPo4PxBI/AAAAAAAAA1k/vJ14thHsB5s/s1600/8_final_out.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-WDVSfFE-9NM/TmJcPo4PxBI/AAAAAAAAA1k/vJ14thHsB5s/s1600/8_final_out.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;The original also looks "reasonable" on its own - that is, the approximations tend to have good curvature characteristics.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;To find the approximate curve, we simply "search" a whole range of scalar values by trying them and measuring curve error.&amp;nbsp; In the scenery tools code, we do a two-step search, refining the scalars around the values of least error.&amp;nbsp; The initial values are picked experimentally; it's almost certainly possible to do a better job of guessing scalar values but I haven't had time to research it more.&lt;br /&gt;&lt;br /&gt;To measure the error we approximate the bezier with polylines (e.g. we turn each individual bezier into a poly-line of N segments) and then compare the polylines.&amp;nbsp; The polyline comparison is the variance of the distances of each point in one polyline to the other. (in other words, we treat one polyline as a point set and take the variance of the distance-to-polyline of each point).&amp;nbsp; This is similar to the &lt;a href="http://en.wikipedia.org/wiki/Hausdorff_distance"&gt;Hausdorff distance&lt;/a&gt; with two key differences:&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;Because we are taking variance and not a minimum error, we can't use our previous minimum distance from a point to a line segment to limit our spatial searches.&amp;nbsp; (See below.)&amp;nbsp; Instead, we pick some large distance beyond which the curves are too different and we use that to limit.&amp;nbsp; For low maximum acceptable errors this gives us good performance.&lt;/li&gt;&lt;li&gt;Since the variance depends on all points and not just the worst one, we can rank multiple approximations - that is, generally better approximations score quite a bit higher.&lt;/li&gt;&lt;/ul&gt;To speed up the polyline compare (which is naively O(N^2) and dominates CPU time if implemented via a nested for-loop) we can create a spatial index for the "master" line (since we will compare many candidates to it) and search only a few of the master segments when looking for the closest one.&amp;nbsp; If we know that past a certain error we're simply broken, then we can limit our queries. For my implementation, I pick an axis (e.g. X or Y) based on the original curve shape and then subdivide the polyline into monotone sub-polylines that can be put into a coordinate-sorted map.&amp;nbsp; Then we can use lower_bound to find our segments in log(N) time.&amp;nbsp; An rtree would probably work as well.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Phew.&amp;nbsp; So we can take a piece-wise bezier and come up with the best approximation through brute force and error checking.&amp;nbsp; How do we simplify an entire road?&amp;nbsp; The answer is &lt;i&gt;not&lt;/i&gt; &lt;a href="http://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm"&gt;Douglas-Peuker&lt;/a&gt;.&amp;nbsp; Instead we use a bottom-up combine:&lt;/div&gt;&lt;div&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;For every non-end node in the piece-wise curve, we build the approximation of the two adjoining bezier curves and measure its error.&lt;/li&gt;&lt;li&gt;We queue every "possible merge" by error.&lt;/li&gt;&lt;li&gt;Until the queue is empty or the lowest error is too large we..&lt;/li&gt;&lt;li&gt;Replace the two curves in the merge by one.&lt;/li&gt;&lt;li&gt;Recalculate the two neighboring merges (since one of their source curves is now quite a bit different).&amp;nbsp; Note that we must keep the original beziers around to get accurate error metrics, so a merge of two curves that originally covered eight curves is an approximation of all eight originals, not the two previous merges.&lt;/li&gt;&lt;/ul&gt;Why is this better than DP?&amp;nbsp; Well, the cost of approximating two curves is going to be O(NlogN) where N is the number of pieces in the piece-wise curve - this comes straight from the cost of error measurement.&amp;nbsp; Therefore the first run through DP, just to add one point back, is going to be O(N^2logN) because it must do a full curve comparison between the original and every pair it tries.&amp;nbsp; When the curve approximation is going to contain a reasonably large number of pieces (e.g. we might merge each bezier with its neighbor and be done) the bottom-up approach gets there fast while DP does a ton of work just to realize that it didn't need to do that work.&amp;nbsp; (DP doesn't have this issue with polylines because the comparison cost is constant per trial point.)&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-1235395280447638425?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/1235395280447638425/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/09/bezier-curve-optimization.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1235395280447638425'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1235395280447638425'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/09/bezier-curve-optimization.html' title='Bezier Curve Optimization'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-noh4LnJOKjg/TmJcOLmW_KI/AAAAAAAAA1I/qWHc2mTR3hU/s72-c/1_orig_poly_line.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-1951579825614933194</id><published>2011-09-01T18:07:00.002-04:00</published><updated>2011-09-02T23:10:39.126-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='STL'/><title type='text'>Sequences Vs. Iterators</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Lately I've been playing with an STL "sequence" concept.&amp;nbsp; That's a terrible name and I'm sure there are other things that are officially "sequences" (probably in Boost or something) and that what I am calling actually have some other name (again, probably in Boost).&amp;nbsp; Anyway, if you know what this stuff should be called, please leave a comment.&lt;br /&gt;&lt;br /&gt;EDIT: Arthur pointed me to &lt;a href="http://www.uop.edu.jo/download/PdfCourses/Cplus/iterators-must-go.pdf"&gt;this paper&lt;/a&gt;; indeed what I am calling sequences are apparently known as "ranges".&amp;nbsp; Having just recoded victor airway support and nav DB access using ranges, I can say that the ability to rapidly concatenate and nest filters using adapter templates is a &lt;i&gt;huge&lt;/i&gt; productivity win.&amp;nbsp; (See page 44 to have your brain blown off of its hinges.&amp;nbsp; I thought you needed Python for that kind of thing!)&lt;br /&gt;&lt;br /&gt;Basically: a sequence is a data type that can move forward, return a value, and knows for itself that it is finished - it is the abstraction of a C null-terminated string.&amp;nbsp; Sequences differ from forward iterators in that you don't use a second iterator to find the end - instead you walk the sequence until it ends.&lt;br /&gt;&lt;br /&gt;Why would you ever want such a creature?&amp;nbsp; The use case that really works nicely is adaptation.&amp;nbsp; When you want to adapt an iterator (e.g. wrap it with another iterator that skips some elements, etc.) you need to give your current iterator both an underlying "now" iterator and the end; similarly, the end iterator you use for comparison is effectively a place-holder, since the filtered iterator already knows where the end is.&lt;br /&gt;&lt;br /&gt;With sequences life is a lot easier: the adapting sequence is done when its underlying sequence is done.&amp;nbsp; Adapting sequences can easily add elements (e.g. adapt a sequence of points to a sequence of mid points) or remove them (e.g. return only sharp corners).&lt;br /&gt;&lt;br /&gt;My C++ sequence concept uses three operators:&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;Function operator with no arguments to tell if the sequence is still valid - false means it is finished.&lt;/li&gt;&lt;li&gt;Pre-increment operator advances to the next item.&lt;/li&gt;&lt;li&gt;Dereference operator returns the current value.&lt;/li&gt;&lt;/ul&gt;You could add more - post increment, copy constructors, comparisons, etc. but I'm not sure that they're necessary.&amp;nbsp; The coding idiom looks like this:&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;while(my_seq())&lt;br /&gt;{&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; do_stuff_to(*my_seq);&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; ++my_seq;&lt;br /&gt;}&lt;/code&gt;&lt;/blockquote&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-1951579825614933194?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/1951579825614933194/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/09/sequences-vs-iterators.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1951579825614933194'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1951579825614933194'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/09/sequences-vs-iterators.html' title='Sequences Vs. Iterators'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8363224155171652564</id><published>2011-08-29T21:02:00.000-04:00</published><updated>2011-08-29T21:02:45.691-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CGAL'/><title type='text'>merge_edge - Fixed, Sort of.</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;A while ago I wrote that you &lt;a href="http://hacksoflife.blogspot.com/2008/10/cgla-abusing-mergeedge.html"&gt;can't use CGAL's merge_edge&lt;/a&gt; if the two half-edges run in opposite X directions.&amp;nbsp; It turns out this isn't entirely true; as of CGAL 3.4 (yes, it's been a while since we went to latest) you can merge if you're very careful.&lt;br /&gt;&lt;br /&gt;The issue is that CGAL caches the direction of a half-edge and doesn't invalidate the cache when you merge the edge.&amp;nbsp; Since it is replacing the curve of one of the two edges (the other is deleted) the cache could get out of sync with the curve, which causes chaos.&lt;br /&gt;&lt;br /&gt;The work-around requires that you know which of two edges is going to be saved.&amp;nbsp; If you pass two halfedges h1 and h2 such that h1's target is h2's source (and that point is the vertex to be removed) then h1 and its twin are kept (and represent the merged edge) and h2 (and its twin) are deleted.&lt;br /&gt;&lt;br /&gt;If h1 has the same direction as the new curve, you can simply merge h1,h2.&amp;nbsp; But if they run in opposite directions this means that h1 and h2 must not have the same direction (because two same-direction curves add up to the same direction curve).&amp;nbsp; If h1 does not match the curve then h2 does . If h2 matches the curve then h2's twin matches the curve's opposite.&amp;nbsp; Therefore the wokr-around when the curve and h1 don't match is to merge h2's twin, h1's twin with the curve reversed.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8363224155171652564?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8363224155171652564/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/mergeedge-fixed-sort-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8363224155171652564'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8363224155171652564'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/mergeedge-fixed-sort-of.html' title='merge_edge - Fixed, Sort of.'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-51659324961129581</id><published>2011-08-22T14:03:00.000-04:00</published><updated>2011-08-22T14:03:36.761-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Computational Geometry'/><title type='text'>The Joys of Bezier Curves [NOT]</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;Someone please remind me why bezier curves are such a common parametric curve choice in the computer graphics world?&amp;nbsp; Some of their charming properties...&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;No analytic solution for the curve's length.&amp;nbsp; The integral will make you cry.&lt;/li&gt;&lt;li&gt;No analytic solution for the intersection of two curves.&amp;nbsp; Well, &lt;a href="http://www.truetex.com/bezint.htm"&gt;this guy&lt;/a&gt; found one, but he's not going to tell you what it is.&lt;/li&gt;&lt;li&gt;No solution to find the closest point of encounter between two disjoint curves. &lt;/li&gt;&lt;li&gt;No analytic solution to find the parametric value to split the bezier at a particular known length interval (e.g. into two halves of equal length).&lt;/li&gt;&lt;/ul&gt;You can subdivide a bezier curve into X or Y monotone regions analytically - to do this you intersect the X or Y parametric derivative with 0 and solve for t using the quadratic equation.&lt;br /&gt;&lt;br /&gt;You can also intersect a bezier curve with a horizontal or vertical line - to do this you fill in the line coordinate and use the cubic equation (which does have a long but scary analytical solution) to find the roots.&amp;nbsp; (See &lt;a href="http://dev.x-plane.com/cgit/cgit.cgi/xptools.git/plain/src/Utils/CompGeomDefs2.h"&gt;here&lt;/a&gt; for code.)&lt;br /&gt;&lt;br /&gt;Well, at least they're not riddled with &lt;a href="http://www.google.com/search?q=bezier+curve&amp;amp;btnG=Search+Patents&amp;amp;tbm=pts&amp;amp;tbo=1&amp;amp;hl=en"&gt;patents&lt;/a&gt;.&amp;nbsp; Oh wait...&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-51659324961129581?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/51659324961129581/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/joys-of-bezier-curves-not.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/51659324961129581'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/51659324961129581'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/joys-of-bezier-curves-not.html' title='The Joys of Bezier Curves [NOT]'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7239322376717594968</id><published>2011-08-19T15:13:00.000-04:00</published><updated>2011-08-19T15:13:38.348-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OS X'/><category scheme='http://www.blogger.com/atom/ns#' term='Installers'/><title type='text'>installer</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;A few tricks to get out of jail with OS X installs and updates...&lt;br /&gt;&lt;br /&gt;If you have a package that just won't install due to bad voodoo (this can happen if you install a lot of seeds and the stars misalign) you can use this to force the install with this:&lt;br /&gt;&lt;code&gt;sudo CM_BUILD=CM_BUILD COMMAND_LINE_INSTALL=1 installer -verbose -pkg MacOSXUpd10.6.5.pkg -target /&lt;/code&gt;&lt;br /&gt;If you need to install from an OS CD you can find the package to use for this trick in&lt;br /&gt;&lt;code&gt;/Volumes/volname/System/Installatoin/Packages/OSInstall.mpkg&lt;/code&gt;&lt;br /&gt;One use for this is to force an install onto a partition that the OS doesn't understand.&amp;nbsp; I had my main drive triple-booted to OS X 10.5.8, Windows Vista (don't get me started) and Ubuntu 8.whatever.&amp;nbsp; Remaking this delicate balance without three OS reinstalls is virtually impossible, but the OS X Snow Leopard installer didn't want to install because it didn't understand the partition map.&lt;br /&gt;&lt;br /&gt;The fix on the net is to resize the OS X partition, which causes Disk Utility to fondle the partition map in some useful way, but there's no way I want to risk my other OS installs.&amp;nbsp; Installing the OS from the command line with a forced install lets me simply dump the OS onto the drive, and then rEFIt just works because, well, it's rEFIt.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7239322376717594968?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7239322376717594968/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/installer.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7239322376717594968'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7239322376717594968'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/installer.html' title='installer'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8990050618443495842</id><published>2011-08-18T13:03:00.000-04:00</published><updated>2011-08-18T13:03:58.875-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Wordpress'/><title type='text'>Why is Wordpress Slow?</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;I love WordPress, so it breaks my heart when our nice shiny WordPress pages take 9 seconds to load.&amp;nbsp; Here are the results of some investigations.&amp;nbsp; I don't do WP professionally, but WP is really easy to tinker with, so I'll blog this to avoid forgetting it.&lt;br /&gt;&lt;br /&gt;The X-Plane blog uses &lt;a href="http://wordpress.org/extend/plugins/wp-super-cache/"&gt;WP Super Cache&lt;/a&gt; and a pile of social networking plugins.&amp;nbsp; Here's what I found for speed.&lt;br /&gt;&lt;br /&gt;When we miss the cache, page load is really slow.&amp;nbsp; You can tell whether you missed the cache and why it's slow by looking at the last few lines of the page; WP super cache will list the time the cached page was last built, the render time in seconds, and it'll tell you if it's gzipping.&amp;nbsp; We see dynamic content times from 2 to 9 seconds!&lt;br /&gt;&lt;br /&gt;Disabling &lt;a href="http://wordpress.org/extend/plugins/tweet-this/"&gt;Tweet This&lt;/a&gt; brought the time down to less than a second.&amp;nbsp; I don't know what Tweet This is doing (probably blocking on IO with Twitter on the server side while the page renders) but our first action will be to explore whether another plugin is faster.&lt;br /&gt;&lt;br /&gt;So the number one issue is that when we miss the cache, run time costs are killing us.&lt;br /&gt;&lt;br /&gt;When we hit the cache, Safari's timeline view of resources tells us what is causing the page to be slow to load.&amp;nbsp; We can see a few things:&lt;br /&gt;&lt;ul style="text-align: left;"&gt;&lt;li&gt;One plugin is putting its java script link at the end of the page, so we lose parallelism in loading.&lt;/li&gt;&lt;li&gt;Some plugins are going to external sites with much higher latency than our server.&lt;/li&gt;&lt;li&gt;We're "scattering" a bit to get our JS - consolidation might be a good idea, but in practice we don't have enough JS to care, plus the browser will cache.&lt;/li&gt;&lt;/ul&gt;We have some slow-loading stragglers from social media live content, but they don't block page load, so I guess we can live with that for now.&lt;br /&gt;&lt;br /&gt;Finally the third and weirdest finding: if you have local browser cookies, WP Super Cache dutifully caches the customized version of the page you see with the forms filled out.&amp;nbsp; This means that you get your own private cache site.&lt;br /&gt;&lt;br /&gt;This is a bit terrifying first because we could have a really huge build-up of files in the cache. But it's also bad news for cache hits.&amp;nbsp; When the site changes (e.g. a comment is posted) the first user to view the page eats the cache miss.&amp;nbsp; But if you have cookies, you always miss the cache since you have your own private cache.&lt;br /&gt;&lt;br /&gt;In our case this is really bad: it means that users who have commented before (and thus have comment-name cookie in place) will always miss the cache once for every new article they see plus every comment posted.&amp;nbsp; Which is to say, the site will virtually always be "slow" (which in our case is "really slow" due to slow plugins). &lt;br /&gt;&lt;br /&gt;I discovered this by putting WP Super Cache in debug mode, setting my IP as the debug URL and setting the debug level to 5.&amp;nbsp; Then when I first loaded a page in FireFox, I saw a whole pile of cache output due to cookies - when I viewed the cache meta data on the server, my own commenting name and email were clearly visible.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8990050618443495842?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8990050618443495842/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/why-is-wordpress-slow.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8990050618443495842'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8990050618443495842'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/why-is-wordpress-slow.html' title='Why is Wordpress Slow?'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-997792244476189626</id><published>2011-08-18T11:35:00.000-04:00</published><updated>2011-08-18T11:35:59.515-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Wordpress'/><title type='text'>Fixing Wordpress Auto-Update Issues</title><content type='html'>&lt;div dir="ltr" style="text-align: left;" trbidi="on"&gt;There are a ton of posts from people having trouble auto-updating WordPress.&amp;nbsp;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.nerdgrind.com/wordpress-automatic-upgrade-plugin-failed-or-not-working/"&gt;This is the post&lt;/a&gt; that has the solution to the underlying problem.&amp;nbsp; I will try to explain why it works, since I totally misunderstood this the first time and without understanding it, it's hard to fix.&lt;br /&gt;&lt;br /&gt;When WordPress auto-updates your blog, it doesn't do so as the "apache" user that usually runs httpd.&amp;nbsp; Instead it uses your ftp login to place files into the local file system as "you".&lt;br /&gt;&lt;br /&gt;This is very clever because it means that you don't have to give apache cart blanche over your site, protecting you from web daemons run amock (or whatever it is that web developers worry about).&lt;br /&gt;&lt;br /&gt;So the first key point is that you need to allow httpd to make FTP calls out to servers.&amp;nbsp; That's where&lt;br /&gt;&lt;code&gt;/usr/sbin/setsebool -P httpd_can_network_connect=1&lt;/code&gt;&lt;br /&gt;comes in.&amp;nbsp; This gives httpd permission to make outgoing network connections so that it can call up your server via FTP as you.&amp;nbsp;&lt;br /&gt;&lt;br /&gt;Without this you get the "Failed to connect to FTP server XXX" error (because httpd isn't allowed to make the outgoing connection - what's tricky here is that it is the client side of FTP that's failing, not your FTP server).&lt;br /&gt;&lt;br /&gt;The second key point is that if you have multiple users administrating WP, things aren't going to go well.&amp;nbsp; The ownership of plugins will be 644 with the only person who has write permissions being you.&amp;nbsp; If another site admin tries to update, you get an error saying that WP couldn't remove the old version of the plugin.&lt;br /&gt;&lt;br /&gt;I don't have a great solution for this yet.&amp;nbsp; I'll update this post when we fix the problem.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-997792244476189626?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/997792244476189626/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/fixing-wordpress-auto-update-issues.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/997792244476189626'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/997792244476189626'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/08/fixing-wordpress-auto-update-issues.html' title='Fixing Wordpress Auto-Update Issues'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5857955678389943294</id><published>2011-06-01T14:59:00.004-04:00</published><updated>2011-06-01T15:36:09.695-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>Guessing the Fine Print</title><content type='html'>One of the great things about OpenGL is that it can draw things &lt;span style="font-style: italic;"&gt;really fast&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;One of the great things about OpenGL is that it's &lt;span style="font-style: italic;"&gt;really flexible&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;But is it fast &lt;span style="font-style: italic;"&gt;and&lt;/span&gt; flexible?  No.  There are "fast paths"; ask the GL to do something adequately byzantine and it's going to get the job done by a correct but not particularly optimized driver path.&lt;br /&gt;&lt;br /&gt;I have had one or two occasions to peer over the shoulder of driver writers and see what a production driver looks like.  Here's a taste.  (Note: this is made up for illustration...no NDAs were harmed in the creation of this example.)&lt;br /&gt;&lt;br /&gt;&lt;code&gt;/* Figure out if we can push vertices through the fast path. */&lt;br /&gt;&lt;br /&gt;#if X2913_GPU&lt;br /&gt;#if USE_FAST_PATHS&lt;br /&gt;if (&lt;br /&gt;  vbo-&amp;gt;internal.struct_align % STRUCT_ALIGN_MOD == 0 &amp;amp;&amp;amp;&lt;br /&gt;  (vbo-&amp;gt;source_mode == SOURCE_AGP || !vbo-&amp;gt;resident) &amp;amp;&amp;amp;&lt;br /&gt;#if FIX_STALL_BUG&lt;br /&gt;  vbo-&amp;gt;current_day != AGP_YES_IT_IS_TUESDAY &amp;amp;&amp;amp;&lt;br /&gt;#endif&lt;br /&gt;  vbo-&amp;gt;internal.size &amp;gt; MIN_SIZE_FOR_FAST_PATH &amp;amp;&amp;amp;&lt;br /&gt;  vbo-&amp;gt;internal.size &amp;lt; MAX_SIZE_FOR_FAST_PATH &amp;amp;&amp;amp;&lt;br /&gt;  IS_SIZE_WE_LIKE(vbo-&amp;gt;internal.size) &amp;amp;&amp;amp;&lt;br /&gt;#endif &lt;br /&gt;  FREE_SPACE(CMD_PACKET_BUF(OUR_CONTEXT)) &amp;gt; CMD_PACKET_ACCEL_SIZE)&lt;br /&gt;  { &lt;br /&gt;      /* do accelerated case *&lt;br /&gt;  } else&lt;br /&gt;      /* another 50,000 conditions have to be met. */&lt;br /&gt;#endif&lt;br /&gt;#if NEXT_GPU_THAT_IS_TOTALLY_DIFFERENT&lt;br /&gt;  ...&lt;br /&gt; &lt;/code&gt;&lt;br /&gt;What we have here might qualify as a &lt;a href="http://www.joelonsoftware.com/articles/LeakyAbstractions.html"&gt;leaky abstraction&lt;/a&gt; (at least with respect to performance): the fast path isn't obvious from the OpenGL API, but  it matters.&lt;br /&gt;&lt;br /&gt;Well, every now and then, you get to see yourself fall off the fast path.  Ouch!&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-4UMIyCROn2c/TeaOGMAyOBI/AAAAAAAAAt8/rY44LEVjEtQ/s1600/copy-sub-tex-fail.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 92px;" src="http://3.bp.blogspot.com/-4UMIyCROn2c/TeaOGMAyOBI/AAAAAAAAAt8/rY44LEVjEtQ/s200/copy-sub-tex-fail.png" alt="" id="BLOGGER_PHOTO_ID_5613330222518777874" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This is is a screenshot of an instruments 2.x trace (with the time profiler - 1.x won't give you this kind of info) of X-Plane with a fast path failure.  In this case, we do a glCopyTexSubImage2D and...bad things happen! It's taking 67% of our frame time!  In this case, we can sort of guess what the driver might be doing.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;57% of the time goes into a gldFinish - I speculate that that's Apple asking nVidia to finish filling pixels on a surface.  This of course  goes through into Kernel space and spends a lot of time doing things that have "wait" in them.&lt;/li&gt;&lt;li&gt;Another 8.2% is in glgProcessPixelsWithProcessor - that sounds a lot like Apple using the host to do some kind of pixel op.&lt;/li&gt;&lt;/ul&gt;Put it together and we realize: whatever we asked for, the driver can't do it in hw, so instead the OpenGL stack is stalling until the GPU completes, reading the data back to the host, and processing it. &lt;br /&gt;&lt;br /&gt;Driver monitor confirms this - with non-zero "time spent waiting in user code" (meaning a call that might not block is blocking) and a non-zero texture page off bytes (meaning something in VRAM had to be copied back to the host).  This is &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; what we want out of glCopyTexImage2D.  Generally we never want to copy anything off of the GPU and we never want to wait in host.&lt;br /&gt;&lt;br /&gt;What did it turn out to be?  Well, the first surprise was that we were using glCopyTexImage2D at all (and not using an FBO).  It turns out that we were reading back from an RGBA16F surface into an RGBA8 texture in a misguided attempt to cope with mismatching gamma.  Of course, the driver could in theory build a custom shader to make that transformation, but it's very reasonable to expect a punt.  Getting the two surfaces to the same format and using an FBO fixed the problem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5857955678389943294?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5857955678389943294/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/06/guessing-fine-print.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5857955678389943294'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5857955678389943294'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/06/guessing-fine-print.html' title='Guessing the Fine Print'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-4UMIyCROn2c/TeaOGMAyOBI/AAAAAAAAAt8/rY44LEVjEtQ/s72-c/copy-sub-tex-fail.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8734028122545313733</id><published>2011-06-01T12:15:00.004-04:00</published><updated>2011-06-01T12:50:20.429-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>EXT vs ARB - The Fine Print</title><content type='html'>I thought I had found a driver bug: my ATI card on Linux rejecting a G-Buffer with a mix of RGBA8 and RG16F surfaces. I  know the rules: DX10-class cards need the same bit plane width for all MRT surfaces.&lt;br /&gt;&lt;br /&gt;I had good reason to think driver bug: the nappy old drivers I got from Ubuntu 10.04 showed absolutely no shadows at all, weird flicker, incorrect shadow map generation - and cleaning them out and grabbing the 11-5 Catalyst drivers fixed it.&lt;br /&gt;&lt;br /&gt;Well, when the drivers don't work, there's always another explanation: an idiot app developer who doesn't know what his own app does.  In that guy's defense, maybe the app is large and complex and has distinct modules that sometimes interact in surprising ways.&lt;br /&gt;&lt;br /&gt;In my case, the ATI drivers follow the rules: the "EXT" variant of the framebuffer extension: an incomplete format error is returned if the color attachments aren't all of the same internal type.  This was relaxed in the "ARB" variant, which gives you more MRT flexibility.&lt;br /&gt;&lt;br /&gt;What amazes me is that the driver cares!  The driver actually tracks which entry point I use and changes the rules for the FBO based on how I got in.  Lord knows what would happen if I mixed and matched entry points.  I feel bad for the poor driver writers for having to add the complexity to their code to manage this correctness.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8734028122545313733?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8734028122545313733/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/06/ext-vs-arb-fine-print.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8734028122545313733'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8734028122545313733'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/06/ext-vs-arb-fine-print.html' title='EXT vs ARB - The Fine Print'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-9176791006422537425</id><published>2011-05-26T14:58:00.002-04:00</published><updated>2011-05-26T15:10:17.101-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CGAL'/><category scheme='http://www.blogger.com/atom/ns#' term='Computational Geometry'/><title type='text'>Mesh Simplification Part III - Simplifying A Triangulation</title><content type='html'>Previously I suggested using a Delaunay triangulation as a spatial index to find "squatters" when simplifying an arrangement.  If we want to simplify a triangulation itself, then the triangulation is the spatial index.&lt;br /&gt;&lt;br /&gt;Consider a constrained Delaunay triangulation, where our arrangement edges have been replaced with triangulation constraints, and there are no free vertices (nor are there vertices in the triangulation that don't have at least one incident constraint).&lt;br /&gt;&lt;br /&gt;We can now use an idea from the &lt;a href="http://www.springerlink.com/content/f3j0667118n71567/"&gt;previously referenced&lt;/a&gt; paper: given a pair of constraints forming a curve pqr that we want to simplify into pr, if there exist any vertices that might be on the edge or in the interior of triangle pqr, then they must be adjacent to q in the triangulation (and between pq and pr on the accute side of pqr).&lt;br /&gt;&lt;br /&gt;This means that we can simply circulate vertex q to search for squatters.  The triangulation &lt;span style="font-style: italic;"&gt;is&lt;/span&gt; the index.&lt;br /&gt;&lt;br /&gt;Why does this work?  Well, consider the case where there exists a vertex X inside triangle PQR that is not adjacent to Q.  You can't have free vertices in a triangulation; the act of triangulating out X is going to create at least one link between Q and X; the only way that this will not happen is if there is already some &lt;span style="font-style: italic;"&gt;other&lt;/span&gt; point inside PQR that is closer to Q than X (and in that case, we fail  for that other point).&lt;br /&gt;&lt;br /&gt;The triangulation also has the nice property that when we remove a vertex X, we can reconsider its adjacent vertices to see if X was a squatter of those other vertices.  This works because if X is a squatter of Q and there are no other squatters (thus removing X "unlocks" Q) then X and Q must be connected.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Implementation Notes&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In my case I have one other implementation consideration besides the usual requirements: in my case, I have a many-to-many link between vertices in my original arrangement and vertices in my triangulation.  Some triangulation vertices will not have original nodes because they represent subdivision of the triangulation to improve elevation data.  And some original arrangement vertices will not be in the triangulation due to simplification of the triangulation's constraints.&lt;br /&gt;&lt;br /&gt;The problem is: how do we work backward from a triangulation triangle to an original arrangement face?  Given a triangle with a constraint on one side, we need to figure out what arrangement halfedge(s) it links to.&lt;br /&gt;&lt;br /&gt;In order to keep this "back-link" unambiguous, we cannot remove all of the degree 2 vertices from a poly-line of original edge segments.  We need to leave at least one "poly-line interior" vertex in place to disambiguate two paths between vertices in the arrangement.  (This case happens a lot when we have closed loops.)&lt;br /&gt;&lt;br /&gt;In practice, we could never remove the poly-line interior vertices from all paths anyway (because they would collapse to zero paths) but in practice, we don't want to remove them from any poly-line because it makes resolving the original arrangement face more difficult.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-9176791006422537425?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/9176791006422537425/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/mesh-simplification-part-iii.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/9176791006422537425'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/9176791006422537425'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/mesh-simplification-part-iii.html' title='Mesh Simplification Part III - Simplifying A Triangulation'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-2291839265117554424</id><published>2011-05-26T14:33:00.003-04:00</published><updated>2011-05-26T14:59:23.614-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CGAL'/><category scheme='http://www.blogger.com/atom/ns#' term='Computational Geometry'/><title type='text'>Mesh Simplification Part II - Arrangement Simplification</title><content type='html'>In my previous post, I suggested that we can iteratively simplify an arrangement if we can test a node's degree, the pre-existence of the simplifying edge we want to replace it with, and confirm that there are no "squatting" vertices inside the triangle formed by the two old edges and the one new one.&lt;br /&gt;&lt;br /&gt;To simplify an arrangement, therefore, what we really need is a good spatial index to make searching for squatters fast.&lt;br /&gt;&lt;br /&gt;Previously I had used a quadtree-like structure, but I seem to be getting better results using a Delaunay triangulation.  (This idea is based on the CGAL point_set_2 class).&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We insert every vertex of our arrangement into a Delaunay triangulation.&lt;/li&gt;&lt;li&gt;When we want to check for squatters, we find the minimum circle enclosing the triangle pqr (where pqr is the curve pair we want to simplify to pr) and search the triangulation for nodes inside the circle.&lt;/li&gt;&lt;/ul&gt;To search the Delaunay triangulation for nodes within a fixed distance of point P, we first insert P (if it isn't already present) and then do a search (depth or breadth-search) from P outward based on vertex adjacency.  When done, we remove P if it wasn't already part of the triangulation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Implementation Details&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For my implementation using arrangements, there are a few quirks:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;I use my own point set; CGAL's point set uses a stack-based depth-first search that tends to flood the stack for large data sets.&lt;/li&gt;&lt;li&gt;I do not re-queue previously "blocked" points as squatters are removed.  This would be a nice feature to add at some point (but is not easily added with a mere index).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I abuse CGAL's "merge_edge" routine to do the simplification.  Edge merge was meant for collinear curves; in my case I pre-ensure that it is a safe operation.  The advantage of using merge_edge vs. actually inserting the new edges and removing the old ones is speed and stability: no faces are created or removed, thus face data stays stable, and no geometry tests are needed to determine what holes go in what face, etc.&lt;/li&gt;&lt;li&gt;Because I am edge-merging, I can't merge two edges that have opposite x-monotone "direction" - thus some details won't get simplified.  This is a limitation of CGAL's arrangement interface.&lt;/li&gt;&lt;/ul&gt;Here's why the last point happens: CGAL caches the "direction" (left to right or right to left) of its X-Monotone curves on the half-edge itself.  Since merge assumes that we aren't moving the point-set that is the curve, but rather glue-ing two curves together in-place, it assumes that the merged half-edge direction cannot have changed.  Thus it does not recalculate the direction flag.&lt;br /&gt;&lt;br /&gt;Since the method recycles two of the four half-edges in the merge, if the first half of the curve points in the opposite direction of the merged curve, the merge is changing the half-edge's direction.&lt;br /&gt;&lt;br /&gt;Could this case happen if the merged edge had the same path as the original two edges?  No.  In order for the direction to change, the two underlying curves cannot be summed to a single curve that is still &lt;span style="font-style: italic;"&gt;x-monotone&lt;/span&gt;, which is a requirement for CGAL's arrangement.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-2291839265117554424?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/2291839265117554424/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/mesh-simplification-part-i-arrangement.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2291839265117554424'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2291839265117554424'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/mesh-simplification-part-i-arrangement.html' title='Mesh Simplification Part II - Arrangement Simplification'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-727262099152249558</id><published>2011-05-26T13:38:00.002-04:00</published><updated>2011-05-26T14:33:49.802-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CGAL'/><category scheme='http://www.blogger.com/atom/ns#' term='Computational Geometry'/><title type='text'>Mesh Simplification Part I - It's All About Squatters</title><content type='html'>I've been working on map simplification for a few days now - it seems like I periodically have to revisit this problem.  After working on simplification yet again, I realized that the problem statement is even simpler than I realized.&lt;br /&gt;&lt;br /&gt;Given an arrangement (that is, a set of line segments and possibly free points such that line segments don't cross or end in each other's interiors) we can iteratively simplify the map by replacing adjacent pairs of line segments with a "short-cut" (e.g. replace line segments pq and qr with pr) given the following conditions:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The degree of vertex q is 2 (e.g. only pq and qr emerge from q).&lt;/li&gt;&lt;li&gt;Line segment pr is not already in the arrangement.&lt;/li&gt;&lt;li&gt;If p, q, and r are not collinear are no points in the interior of triangle pqr (nor directly between p and r).  By definition there can't be any points on pq and qr.&lt;/li&gt;&lt;/ol&gt;Test 3 - the test for "squatters" (that is, points in the interior  or on the border of triangle PQR) is the key.  If any points exist inside PQR then:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;There is some kind of island geometry (or free vertex) and it will be on the wrong side of pqr after simplification, or&lt;/li&gt;&lt;li&gt;The geometry "connects" to the world outside pr and pr will intersect at least one segment.&lt;/li&gt;&lt;/ul&gt;Both cases require us to not simplify.&lt;br /&gt;&lt;br /&gt;Given this, we can build &lt;a href="http://www.springerlink.com/content/f3j0667118n71567/"&gt;an iterative algorithm for simplifying a mesh&lt;/a&gt;:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Put every vertex that passes these tests into a queue, based on the error introduced by removing it.&lt;/li&gt;&lt;li&gt;Remove the first vertex.&lt;/li&gt;&lt;li&gt;Requeue neighboring vertices based on changed error metrics.&lt;/li&gt;&lt;/ul&gt;Note that a vertex that may have been "stuck" before may now be removable, if one of the "squatters" from test 3 was previously removed.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Zone Is Not What We Want&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Previously I had coded similar logic via a zone visiting calculation - that is, finding every face, line and point that the edge pr would intersect.  This had a few problems:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Arrangement zone calculations are really expensive.  Given a simple polygon with X sides, we may have to do as many as X zone calculations (if any vertex is eligible for removal) and the zone calculation iterates the polygon boundary.  Thus we have an O(N^2) calculation, which is really painful for large polygons made of a large number of small sides.  (Sadly, that is precisely what my data tends to be.)&lt;/li&gt;&lt;li&gt;The zone calculation is wrong; even if we don't crash &lt;span style="font-style: italic;"&gt;into&lt;/span&gt; anything while computing the zone, if the zone has holes that would be on inside of triangle PQR then we still should not be simplifying.  So we would have to iterate over holes as well as calculate the zone.&lt;/li&gt;&lt;/ol&gt;Up next: fast arrangement simplification.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-727262099152249558?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/727262099152249558/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/mesh-simplification-part-i-its-all.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/727262099152249558'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/727262099152249558'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/mesh-simplification-part-i-its-all.html' title='Mesh Simplification Part I - It&apos;s All About Squatters'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-573735035385572959</id><published>2011-05-24T11:08:00.002-04:00</published><updated>2011-05-24T11:14:09.656-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>Instancing Limits</title><content type='html'>I posted some &lt;a href="http://hacksoflife.blogspot.com/2011/03/instancing-numbers.html"&gt;instancing numbers&lt;/a&gt; a while ago.  As we keep bashing on things, we've found that the upper limit for instanced meshes on a modern Mac with an ATI card appears to be about 100k instanced batches.&lt;br /&gt;&lt;br /&gt;There is definitely a trade-off between the number of "actual" batches (e.g. the number of actual draw calls into the driver) and the number of instanced meshes.  So we're looking at how to best trade off larger clumps of meshes (fewer driver calls) with smaller ones (less extra drawing when most of the clump is off screen and could have been culled.&lt;br /&gt;&lt;br /&gt;There is also a point at which it's not worth using instancing: if the number of objects in the instanced batch is really low, it's quicker to use immediate mode instancing and call draw multiple times.  We're not precisely sure where that line is, but it's low - maybe 2 or 3 batches.&lt;br /&gt;&lt;br /&gt;(Note that using client arrays to draw an instanced batch where the instancing data is in system memory appears to be a non-accelerated case on 10.6.x - if the instance data isn't in a VBO we see performance fall over and die.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-573735035385572959?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/573735035385572959/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/instancing-limits.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/573735035385572959'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/573735035385572959'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/instancing-limits.html' title='Instancing Limits'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7134274598077875957</id><published>2011-05-24T11:06:00.002-04:00</published><updated>2011-05-24T11:07:25.357-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='CGAL'/><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><title type='text'>I Hate C++ Part 857: But I Already Had Coffee!</title><content type='html'>Sigh...&lt;br /&gt;&lt;code&gt;(ioMesh.is_edge(pts[pts.size()-2],pts[pts.size()-1]),h,vnum)&lt;/code&gt;&lt;br /&gt;Clearly it's time to switch to espresso.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7134274598077875957?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7134274598077875957/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/i-hate-c-part-857-but-i-already-had.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7134274598077875957'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7134274598077875957'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/i-hate-c-part-857-but-i-already-had.html' title='I Hate C++ Part 857: But I Already Had Coffee!'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-6021258296540272648</id><published>2011-05-18T09:01:00.002-04:00</published><updated>2011-05-18T09:01:00.138-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='X-Plane'/><category scheme='http://www.blogger.com/atom/ns#' term='SSE'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>Performance Tuning Cars</title><content type='html'>I took a few hours to performance tune X-Plane 10's cars.  I must admit, this really isn't what I am supposed to be doing, but I can't resist performance tuning, and the cars touch a number of different scenery subsystems at once.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Initial Tests&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I ran a few initial tests to understand the performance problems:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Cars on max, vis distance on max, other parts of the sim "dumbed down" to isolate car costs.&lt;/li&gt;&lt;li&gt;True framerate measured, including &amp;lt; 19 fps.&lt;/li&gt;&lt;li&gt;Looked at 10.5.8 and 10.6.7 to make sure analysis on 10.5.8 wasn't biased by driver performance (which is way better on 10.6.7).&lt;/li&gt;&lt;li&gt;Looked at paused forward view, which isolates car drawing (no car AI when paused), and no-pause down view, which isolates car AI (no drawing when the world is culled).&lt;/li&gt;&lt;li&gt;Sharked both configs, time profile, all thread states, focused on the main thread.&lt;/li&gt;&lt;/ul&gt;Is this test too synthetic? Any time you performance test, you have to trade off how realistic the test is for how much you amplify the cost of the target system to make profiling easier.  In this case, while simply maxing out cars is a bit synthetic (who wants tons of cars and no objects) we can say that a lot of cars looks good as a rendering effect, so having it be fast is a win.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Initial Findings&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The sim was somewhat limited based on car AI (about 20 ms per frame), and heavily bounded on car headlights (we were pushing 500,000+ headlights at about 7-8 fs).  The actual 3-d physical cars were a non-issue: very few due to limited visibility distance, and they all hit the instancing path (which has a huge budget on DX11 hardware).&lt;br /&gt;&lt;br /&gt;The major performance hits were:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;AI: time to edit the quad tree when cars are moved.  Since there is already a cache on this, that means that what's left of this op must be really slow.&lt;/li&gt;&lt;li&gt;The quad tree editing requires access to some per-object properties that aren't inlined.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Drawing: the transform time for car headlights.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;First Line of Attack&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The sim transforms and builds deferred "spill" lights even in the forward renderer.  This is hugely wasteful, as these lights are just thrown out.  And getting rid of it nearly doubles draw performance.  (There's another bit of dumb work - the car headlights are fully running during the day.  I'll wait on that one; the problem is that X-Plane's lighting abstraction  leaves "on/off during the day" totally to the GPU during v10, so we don't eliminate a ton of work.  I'll leave it to later to decide how much to "expose" the implementation to cull out lights that are off during the day.)&lt;br /&gt;&lt;br /&gt;Another "wasted work" note: the spill lights are still transformed in HDR mode, but when we actually need them, things are really bad - about 5 fps.  CPU use is only 70%, which is to say, 500,000 deferred lights is a lot of lights, even for a deferred renderer.  (It may also be that the 2 million vertices pushed down to make this happen is starting to use up bus bandwidth.)  So we can consider a few more optimizations later:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Provide a cutoff distance for spawning deferred lights from dynamic scenery elements.  Since we've measured the distance on these elements anyway (to decide if we want the 3-d car) we could choose to strip out deferred lights.&lt;/li&gt;&lt;li&gt;We may want to migrate the deferred lights into geometry shaders and/or instancing to cut down on-CPU transform time and bus bandwidth.&lt;/li&gt;&lt;li&gt;We may want to "prep" streamed light VBOs on a worker thread.&lt;/li&gt;&lt;/ul&gt;These last two are a bit iffy: geometry shaders aren't available on otherwise very nice ATI hardware for some Mac OS versions, and their throughput with heavy amplification factors on NVidia DX10 hardware is not good.  Geometry shaders would prbobaly not be a win for static deferred lights, where we aren't fighting bus bandwidth.&lt;br /&gt;&lt;br /&gt;Similarly, threading the build-out of VBOs is going to be dependent on driver capability.  If the driver can't accept mapping a buff on a worker and unmapping on a main thread, then we're going to have problems keeping the worker thread independent without pre-allocating a lot of VBO memory.&lt;br /&gt;&lt;br /&gt;Cutting down the LOD distance of deferred lights from 20 km to 5 km takes us from 2.1 million deferred light vertices at 5 fps to 128k vertices at 10 fps.  We can even go down to 2 km for 12 fps.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Inlining&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Another thing that  the Shark profiles showed was that the cars effectively generated some hot loops that had to call non-inlined accessor functions.  I had a note to examine inlining at some point, but in this case a few specific accessors "popped".  Inlining is a trade-off between keeping code clean and encapsulated and letting the compiler boil everything down.&lt;br /&gt;&lt;br /&gt;In this case, I chose to use macros to control whether an inline code file is included from header or translation unit.  X-Code still sees the inlines as a dependency, but we get faster compile and better GDB behavior in debug mode, plus a clean public header.&lt;br /&gt;&lt;br /&gt;Inlining didn't change the performance of drawing car headlights (whose code path was mostly inline already) but it made a significant difference in the car AI (that is, the code that decides where the cars will drive to) - that code had to update the quadtree using non-inline accessors; with 44k cars navigating we went from 41 fps.  (Also notable: the sim will run at 73 fps in the exact "AI" stress case but with cars off - that's 13 ms for basic drawing and the flight model and 11 ms for cars.  There's still a lot to be had here.)&lt;br /&gt;&lt;br /&gt;When the inlining dust settles, we have on-CPU headlight transform taking a lot of time at the low level, and reculling the scene graph when moving cars still showing up prominently.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;More Scrubbing&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At this point we have to look at the low level headlight transformer in  x86 assembly.  Wow, that's not pretty - it looks like someone took a  scrabble set, emptied it on the floor, and then hit it repeatedly with a  hammer.  We can pull some conditional logic out of the tight loop for a  small win (about 5%) but even better: the sim is trying to "modulate"  the phase of headlights per headlight.  This is a clever technique to  authors, but totally stupid for headlights because they don't flash.   Pull that out and we are getting 695k headlights at 14.5 fps.  There's  no subtitute for simply not doing things.&lt;br /&gt;&lt;br /&gt;The assembly is spending a surprising amount of its time in the matrix  transform.  It's a rare case where SSE can be a win - in this case, we  can squeeze another 15% out of it.  Note that before I spent any time on SSE, I did an L2 profile - this shows the code's time by who is missing the L2 cache.  The hot loop for transform didn't show up at all, indicating that the time was not being spent waiting for memory.  This surprised me a little bit, but if the memory subsystem is keeping up, trying to cram more instructions through the CPU can be a win.&lt;br /&gt;&lt;br /&gt;Now you might think: why not just push the work onto another core, or better yet the GPU?  The answer is that the particular problem doesn't play well with the more performant APIs.  We do have half a million transforms to do, but:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Since the transform is going straight into AGP memory, filling the buffers with multiple threads would require OpenGL sync code - we may get there some day, but that kind of code requires a lot of in-field validation to prove that the entire set of drivers we run on (on three operating systems) handles the case both correctly and without performance penalties.&lt;/li&gt;&lt;li&gt;The vertex data is generated off an irregular structure, making it difficult to put on the GPU.  (This is definitely possible now with the most programmable cards, but it wouldn't run on our entire set of required hardware.)&lt;/li&gt;&lt;/ul&gt;That's tech we may get to some day, but not for now.&lt;br /&gt;&lt;br /&gt;One last note on the scrubbing: it really demonstrated the limits of GCC's optimizer.  In particular, if a call tree is passing param values that always branch one way, but the call sight at which the constant is passed is not inlined with the actual if statement, I can get a win by "specializing" the functions myself.  From what I can tell, GCC can't "see" that far through the function tree.  This is with GCC 4.0 and no guided profiling, so it may be there are better tools.  I also haven't tried LLVM as a back-end yet; LLVM supposedly is quite clever with inference.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Multi-Core = Free?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There's one last trick we can pull: we can move the per-frame car AIs off onto another thread that runs parallel to the flight model.  On a multi-core machine this is "free" processing if the worker threads are not saturated.  When looking "down" (so the AI is the only cost) this takes us from 65 to 80 fps.  In a forward view with 20 km visibility, we are (after all of our work on CPU) possibly limited by bus bandwidth - CPU use is "only" 85%.  This is important because in this configuration, removing the AI work won't show up as a fps change.  If we reduce the car distance from 20km t0 10km for drawing (we still pay for AI all the time) we still get a few fps back.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Follow-Up: Final SSE Numbers&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I wrote the rest of this post about a week and a half ago.  Today I cleaned up some of my SSE code, the result being that I can turn SSE on and off for the same matrix primitives.&lt;br /&gt;&lt;br /&gt;Putting the sim into a "car-heavy" state causes 23% of frame-time to be spent on car headlights; using SSE improves overall frametime by 6% (that is, we get a 6% net boost in fps).  This implies that the actual car headlights (currently the only SSE-capable code) becomes 26% faster.  Since the headlights are only partly math ops, the actual matrix transforms are &lt;i&gt;significantly&lt;/i&gt; faster.  (Note the baseline still uses non-SIMD single-float SSE as the math ABI.)&lt;br /&gt;&lt;br /&gt;So what can we conclude:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Using SSE for our matrix ops is a big win for actual time spent doing matrices.&lt;/li&gt;&lt;li&gt;X-Plane doesn't have a lot of CPU time in this category. &lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;(In fact, the scenario that gets a 6% SSE win is highly synthetic, with an insane number of cars and not much else; real numbers might not even be noticeable.  A synthetic case for optimizing is thus always dangerous - our real returns aren't as good as what we think we're getting. But in this case it proves that SSE has the &lt;i&gt;potential&lt;/i&gt; to be useful if we can find other sites to deploy the same tricks to.  See also &lt;a href="http://hacksoflife.blogspot.com/2010/11/is-1-lot.html"&gt;this post&lt;/a&gt;.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-6021258296540272648?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/6021258296540272648/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/performance-tuning-cars.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6021258296540272648'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6021258296540272648'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/performance-tuning-cars.html' title='Performance Tuning Cars'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3960546340762952461</id><published>2011-05-17T23:02:00.002-04:00</published><updated>2011-05-18T15:34:40.305-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='SSE'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>SSE?  It's the Memory, Stupid</title><content type='html'>One last SSE note: I went to apply SSE optimizations to mesh indexed matrix transforms.  While applying some very simple SSE transforms improved throughput 15%, that gain went away when I went for a more complex SSE implementation that tried to avoid the cost of unaligned loads.&lt;br /&gt;&lt;br /&gt;Surprising?  Well, when I Sharked the more complete implementation it was clear that it was bound up on memory bandwidth.  Using the CPU more efficiently doesn't help much if the CPU is starved for data.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3960546340762952461?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3960546340762952461/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/sse-its-memory-stupid.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3960546340762952461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3960546340762952461'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/sse-its-memory-stupid.html' title='SSE?  It&apos;s the Memory, Stupid'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8954998856368035222</id><published>2011-05-17T12:53:00.002-04:00</published><updated>2011-05-17T14:07:57.237-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='SSE'/><title type='text'>Seriosly Strange Execution?</title><content type='html'>This is a post in which I try to document what I have learned in SSE 101; if you want to make fun of me for having worked on a flight simulator for five years without writing SSE code*, go ahead now; I'll wait.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Okay then.  The last time I looked at SIMD code was with Altivec; having come from PPC code I'm only barely getting used to this whole "little endian" thing, let alone the mishmash that is x86 assembler.&lt;br /&gt;&lt;br /&gt;So a __m128 looks a lot like a float[4], and it's little endian, so if I do something like this:&lt;br /&gt;&lt;code&gt;float le[4] = { 0, 1, 2, 3 };&lt;br /&gt;__m128 aa = _mm_loadu_ps(le);&lt;/code&gt;&lt;br /&gt;then GDB tells me that aa contains 0, 1, 2, 3 in those "slots".  And a memory inspector shows 0 in the lowest four bytes.  So far so good.&lt;br /&gt;&lt;br /&gt;Then I do this:&lt;br /&gt;&lt;code&gt;__m128 cc = _mm_shuffle_ps(aa,aa,_MM_SHUFFLE(3,3,3,1));&lt;/code&gt;&lt;br /&gt;and I get 1,3,3,3 in rising memory in cc.&lt;br /&gt;&lt;br /&gt;Wha? &lt;br /&gt;&lt;br /&gt;Well, we can actually tease that one apart.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The _MM_SHUFFLE matrix takes its parameters from high to low bits, that is, in binary 3,3,3,1 becomes 11111101 or 0xFD.&lt;/li&gt;&lt;li&gt;Thus the low two bits of the mask contain the shuffle mask (01) for the low order component of my vector.&lt;/li&gt;&lt;li&gt;Thus "1" is selected into the lowest component [0] of my array.&lt;/li&gt;&lt;/ul&gt;The selectors are effectively selecting in the &lt;i&gt;memory&lt;/i&gt; order I see, so a selector &lt;i&gt;value&lt;/i&gt; of 1 selects the [1] component.  (In my LE, I stuffed the content of the __m128 with the array slot as part of a test to wrap my head around this.&lt;br /&gt;&lt;br /&gt;So that's actually completely logical, as long as you understand that _MM_SHUFFLE's four arguments come in as bit-value positions, which are always written "backward" on a little endian machine.  Naively, I would have reversed the macro order (and there's nothing stopping a programmer from creating a "backward" shuffle macro that reads in "array component" order).  While this wouldn't be an issue on a big endian machine, the order of everything would mismatch memory - it's sort of nice that component 0 sits in the low order bits.  Really what we need to do is read from right to left!&lt;br /&gt;&lt;br /&gt;So I thought I had my head around things, until I looked at the contents of %xmm0.  The shuffle code gets implemented in GDB (optimizer off) like this:&lt;br /&gt;&lt;code&gt;movaps %-0x48(%ebp),%xmm0&lt;br /&gt;shufps $0xfd,-0x48(%ebp),%xmm0&lt;br /&gt;movaps %xmm0,-0x28(%ebp)&lt;/code&gt;&lt;br /&gt;If you speak x86, that's like "see spot run", but for those who don't:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;%ebp is the stack frame pointer on &lt;a href="http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/130-IA-32_Function_Calling_Conventions/IA32.html"&gt;OS X&lt;/a&gt;; with the optimizer off my local __m128 variables have been given aligned storage below the frame pointer as part of the function they sit in.  -0x48 is the offset for aa and -0x28 is the offset for cc.&lt;/li&gt;&lt;li&gt;This is GCC disassembly, so the destination is on the right.&lt;/li&gt;&lt;li&gt;SSE operations typically work as src op dst -&amp;gt; dst.&lt;/li&gt;&lt;li&gt;So this code loads aa into %xmm0, shuffles it with itself from memory (the results stay in %xmm0), then write %xmm0 back to cc.&lt;/li&gt;&lt;/ul&gt;We can step through in assembly and look at %xmm0 before and after the shuffle.  And what I see is...well, it sort of makes sense.&lt;br /&gt;&lt;br /&gt;When viewed as a 128 bit integer in the debugger, %xmm0 contains:&lt;br /&gt;&lt;blockquote&gt;128i: 0000803f 00004040 00004040 00004040&lt;br /&gt;4x32i: 40400000 40400000 40400000 3f800000&lt;br /&gt;16x8i: 40 40 00 00  40 40 00 00  40 40 00 00  3f 80 00 00&lt;br /&gt;4x32f: 3.0 3.0 3.0 1.0&lt;br /&gt;&lt;/blockquote&gt;The memory for CC contains this byte string:&lt;br /&gt;&lt;blockquote&gt;00 00 80 3f  00 00 40 40  00 00 40 40  00 00 40 40&lt;br /&gt;&lt;/blockquote&gt;I spent about 15 minutes trying to understand what the hell I was looking at, but then took a step back: if a tree falls in a forest and no one can see the trunk without writing it ot to memory, who cares?  I wrote some code to do unpacks and low-high moves and sure enough, completely consistent behavior.  If you treat an __m128 as an array of four floats, unpack_lo_ps(a,b), for example, gives you { a[0], b[0], a[1], b[1] }.&lt;br /&gt;&lt;br /&gt;So what have I learned?  Well, if you look at an Intel SSE diagram like &lt;a href="http://www.jaist.ac.jp/iscenter-new/mpc/altix/altixdata/opt/intel/vtune/doc/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc320.htm"&gt;this&lt;/a&gt;, my conclusion is: component 0 is the same as the low bits in memory, which is the same as the first item of an array.  The fact that it is drawn on the right side of the diagram is an artifact of our left-to-right way of writing place-value numbers.  (I can only speculate that Intel's Israeli design team must find these diagrams even more byzantine.)&lt;br /&gt;&lt;br /&gt;* This is because until now in the X-Plane 10 development cycle, we haven't needed it - X-Plane 10 is the first build to do a fair amount of "uniform" transform on the CPU.  If anything that's a step back, because we really should be doing that kind of thing on the GPU.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8954998856368035222?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8954998856368035222/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/seriosly-strange-execution.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8954998856368035222'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8954998856368035222'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/seriosly-strange-execution.html' title='Seriosly Strange Execution?'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-759197388595617623</id><published>2011-05-11T01:01:00.003-04:00</published><updated>2011-05-11T01:06:45.306-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Macintosh'/><title type='text'>SvnX on OS X 10.6?  You Need a Key Pair</title><content type='html'>A few members of our art team use a mix of the command line and SvnX to move art asset packs around via SVN.&lt;br /&gt;&lt;br /&gt;One minor hitch: SvnX can't log into a server that uses svn+ssh as its access method if ssh requires a manually typed password.&lt;br /&gt;&lt;br /&gt;The work-around is to establish a private/public key pair for ssh.  Once you do that, keychain will offer to store the password, and SvnX can function normally.&lt;br /&gt;&lt;br /&gt;In theory sshkeychain should let the key chain remember plain passwords, but I couldn't get this to work on 10.6.&lt;br /&gt;&lt;br /&gt;The keypair can be established as follows:&lt;br /&gt;&lt;br /&gt;&lt;code&gt;cd ~/.ssh&lt;br /&gt;ssh-keygen -t rsa&lt;br /&gt;(type desired password, accept default file name)&lt;br /&gt;scp id_rsa.pub you@server.com:/home/you/.ssh/auhorized_keys&lt;br /&gt;(where "you" is your unix login name.  authorized_keys may need a different name for different servers.)&lt;/code&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-759197388595617623?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/759197388595617623/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/svnx-on-os-x-106-you-need-key-pair.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/759197388595617623'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/759197388595617623'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/svnx-on-os-x-106-you-need-key-pair.html' title='SvnX on OS X 10.6?  You Need a Key Pair'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-2195011531326008044</id><published>2011-05-07T14:05:00.004-04:00</published><updated>2011-05-07T14:23:19.952-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>The Limits of 8-bit Normal Maps</title><content type='html'>It's safe to say that when one of the commenters points out &lt;a href="http://hacksoflife.blogspot.com/2010/12/yet-another-this-is-our-gbuffer-format.html"&gt;something that will go wrong&lt;/a&gt;, it's only a matter of time before &lt;a href="http://hacksoflife.blogspot.com/2011/02/g-buffer-normals-revisited.html"&gt;I find it myself&lt;/a&gt;.  In this case the issue was running out of normal map precision and it was a matter of having an art asset sensitive to normal maps.&lt;br /&gt;&lt;br /&gt;Well, our artists keep making newer and weirder art assets, and once again normal maps are problematic.  In particular, when you really tighten up specular hilights, the angular precision per pixel of 8-bit normal maps makes it very difficult to create "small" effects without seeing the quantization of the map.&lt;br /&gt;&lt;br /&gt;I made this graph to illustrate the problem:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-mhr6GS_XKFQ/TcWKimg0dBI/AAAAAAAAAt0/Pyts7UwlZzU/s1600/Picture%2B76.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 179px;" src="http://2.bp.blogspot.com/-mhr6GS_XKFQ/TcWKimg0dBI/AAAAAAAAAt0/Pyts7UwlZzU/s200/Picture%2B76.png" alt="" id="BLOGGER_PHOTO_ID_5604037638390838290" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;So what is this?  This equation (assuming I haven't screwed it up) shows the fall-off of specular light levels as a function of the "displacement" of the non-tangent channels of a normal map.  Or more literally, for every pixel of red you add to move the normal map off to the right, how much less bright does the normal become?&lt;br /&gt;&lt;br /&gt;In this case, our light source is hitting our surface dead on, we're in 8-bit, and I've ignored &lt;a href="http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-2-working-in.html"&gt;linear lighting&lt;/a&gt; (which would make the problems here worse in some cases, better in others).  I've also ignored having specularity being "cranked" to HDR levels - since we do this in X-Plane the effects are probably 2x to 3x worse than shown. Units to the right is added dx and dy vectors, and each unit vertically is a loss of brightness value.&lt;br /&gt;&lt;br /&gt;Three fall-off curves are shown based on exponents ^128, ^1024, and ^4096.  (The steepest and thus most sensitive one is ^4096).  You can think of your specular exponent as an "amplifier" that "zooms in" on the very top of the lighting curve, and thus amplifies errors.&lt;br /&gt;&lt;br /&gt;So to read this: for the first minimal unit of offset we add to the normal map, we lose about two minimal units of brightness.  In other words, even at the top of the curve, with an exponent of ^1024, specular hilights will have a "quantized" look, and a smooth ramp of color is not possible.  It gets a lot worse - add that second unit of offset to the normal map and we lose eight units of color!&lt;br /&gt;&lt;br /&gt;(By comparison, the more gentle 2^128 specular hilight) isn't as bad - we lose six units of brightness for five of offset, so subtle normal maps might not look too chewed up.)&lt;br /&gt;&lt;br /&gt;This configuration could be worse - at least we have higher precision near zero offset.  With tangent space normal maps, large areas of near constant normals tend to be not perturbed very much - because if there is a sustained area of the perceived surface being "perpendicular" to the actual 3-d mesh, the author probably should have built the feature in 3-d.  (At least, this is true for an engine like X-Plane that doesn't have displacement mapping.)&lt;br /&gt;&lt;br /&gt;What can we do?&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Use some form of &lt;a href="http://sebh-blog.blogspot.com/2010/08/cryteks-best-fit-normals.html"&gt;normal map compression&lt;/a&gt; that takes advantage of the full bit-space of RGB.&lt;/li&gt;&lt;li&gt;Throw more bits at the problem, e.g. use RG16.  (This isn't much fun if you're using the A and B channels for other effects that only need 8 bits.)&lt;/li&gt;&lt;li&gt;Use the blue channel as an exponent (effectively turning the normal map into some kind of freaky 8.8 floating point).  This is something we're looking at now, so I'll have to post back as to whether it helps.  The idea is that we can "recycle" the dynamic range of the RG channels when close to dark using the blue channel as a scalar.  This does not provide good normal accuracy for highly perturbed normals; the assumption above is that really good precision is needed most with the least offset.&lt;/li&gt;&lt;li&gt;Put some kind of global gamma curve on the RG channels.  This would intentionally make highly perturbed normals worse to get better results at low perturbations. (I think we're unlikely to productize this, but it might in some cases provide better results using only 16 bits.)&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Tell our artists "don't do that".  (They never like hearing that answer.)&lt;/li&gt;&lt;/ul&gt;I'll try to post some pictures once I am further along with this.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-2195011531326008044?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/2195011531326008044/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/limits-of-8-bit-normal-maps.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2195011531326008044'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2195011531326008044'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/limits-of-8-bit-normal-maps.html' title='The Limits of 8-bit Normal Maps'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-mhr6GS_XKFQ/TcWKimg0dBI/AAAAAAAAAt0/Pyts7UwlZzU/s72-c/Picture%2B76.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5619546705133506835</id><published>2011-05-01T20:37:00.002-04:00</published><updated>2011-05-01T20:48:47.271-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><category scheme='http://www.blogger.com/atom/ns#' term='Memory Management'/><title type='text'>Damn You, L2 Cache!!!</title><content type='html'>So first: &lt;a href="http://www.akkadia.org/drepper/cpumemory.pdf"&gt;this&lt;/a&gt; is a good read.  Having spent the weekend reading about how import it is not to miss cache and being reminded that having your structs fit in cache lines makes you bad-ass, I was all prepared to score an epic win against the forces of &lt;a href="http://hacksoflife.blogspot.com/2006/12/garbage-collection-memory-management-by.html"&gt;garbage collection&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Let me take a step back.&lt;br /&gt;&lt;br /&gt;X-Plane uses a series of custom allocation strategies in places where we know things that the system allocator cannot know (e.g. "all of these blocks have the same life-span", or "these allocations don't have to be thread-safe"), and this got us a win in terms of less CPU time being spent allocating.&lt;br /&gt;&lt;br /&gt;X-Plane also uses a quad-tree-like structure to cull our scene-graph.  The cull operation is very fast, and so (not surprisingly) when you profile the scene graph, the 'hot spots' in the quad tree are all L2 cache misses.  (You can try this on X-Plane 9, just turn down objects to clear out driver time and see in-sim work.)  In other words, the limiting factor on plowing through the scene graph is not CPU processing, rather it's keeping the CPU fed with more quad-tree nodes from memory.&lt;br /&gt;&lt;br /&gt;The nodes in the quad tree come from one of these custom allocation strategies.&lt;br /&gt;&lt;br /&gt;So my clever plan was: modify the custom allocator to try to keep quad-tree nodes together in memory, improving locality, improving cache hits, improving framerate, and proving once again that all of that misery I go through managing my own memory (and chasing down memory scribbles of my own creation) is &lt;span style="font-style: italic;"&gt;totally&lt;/span&gt; worth it!&lt;br /&gt;&lt;br /&gt;Unfortunately, my clever plan made things worse.&lt;br /&gt;&lt;br /&gt;It turns out that the allocation pattern I had before was actually better than the one I very carefully planned out.  The central problem with most parts of X-Plane's scene graph is: you don't know what "stuff" is going to come out of the user's installed custom scenery, and you don't know precisely what will and won't be drawn.  Thus while there is some optimal way to set up the scene graph, you can't precompute it, and you can only come close with heuristics.&lt;br /&gt;&lt;br /&gt;In this case the heuristic I had before (allocation order will be similar to draw order) turns out to be surprisingly good, and the allocation order I replaced it with (keep small bits of the scene graph separate so they can remain local within themselves later) was worse.&lt;br /&gt;&lt;br /&gt;So...until next time, L2 cache, just know that somewhere, deep in my underground lair, I will be plotting to stuff you with good data.  (Until then, I may have to go stuff myself with scotch.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5619546705133506835?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5619546705133506835/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/damn-you-l2-cache.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5619546705133506835'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5619546705133506835'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/05/damn-you-l2-cache.html' title='Damn You, L2 Cache!!!'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7474705995349901816</id><published>2011-04-25T08:13:00.003-04:00</published><updated>2011-04-25T08:35:07.923-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>Going to California (with an Aching in My Heart)</title><content type='html'>Periodically people will try to sum up relative latencies for hardware, but I really like &lt;a href="http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait"&gt;this article&lt;/a&gt;.  In particular, putting memory distance in human terms helps give you a sense of the metaphorical groan your CPU must make every time it misses a cache.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;L1 cache: it's on your desk, pick it up.&lt;/li&gt;&lt;li&gt;L2 cache: it's on the bookshelf in your office, get up out of the chair.&lt;/li&gt;&lt;li&gt;Main memory: it's on the shelf in your garage downstairs, might as well get a snack while you're down there.&lt;/li&gt;&lt;li&gt;Disk: it's in, um, California.  Walk there.  Walk back.  Really.*&lt;/li&gt;&lt;/ul&gt;I had a pretty good idea that L2 misses were bad - when we profile X-Plane, some of the bottlenecks have tight correlation between L2 cache misses and total-time spent.  And I knew disks were slow, but...not that slow.&lt;br /&gt;&lt;br /&gt;If anything, that's a testimant to how good the operating systems are at &lt;a href="http://duartes.org/gustavo/blog/post/page-cache-the-affair-between-memory-and-files"&gt;hiding the disk drive from us&lt;/a&gt; most of the time.&lt;br /&gt;&lt;br /&gt;The moral of the story: the disk can look a lot faster than it is, but only if you let it.  Unfortunately, there is one aspect of X-Plane that fails miserably at this: the use of a gajillion tiny text file for scenery packages.  The solution is simple: &lt;a href="http://wiki.x-plane.com/DSF_Art_Asset_Storage_RFC"&gt;pack the files into one bigger file&lt;/a&gt;.  This will let the OS pick up the (hopefully consecutive) single larger file and dump significant amounts of it into the page cache in one swoop without doing a million seeks. California is far away.&lt;br /&gt;&lt;br /&gt;* The author's metaphor maps one cycle to one human second.  That's the equivalent of 474. days for a 3 ghz CPU to take a 41 ms wait on a disk seek.  You'd have to put up better than 12 miles a day to make it to California and back from the East coast.  If you actually live out west, um, pretend you're an SSD.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7474705995349901816?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7474705995349901816/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/04/going-to-california-with-aching-in-my.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7474705995349901816'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7474705995349901816'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/04/going-to-california-with-aching-in-my.html' title='Going to California (with an Aching in My Heart)'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3601782664667833559</id><published>2011-04-22T16:01:00.003-04:00</published><updated>2011-04-26T10:56:26.206-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>So Many AA Techniques, So Little Time</title><content type='html'>This is a short summary of FSAA techniques, both for the art team, and so I don't forget what I've read when I come back to this in 9 months.  (No promise on accuracy here, these are short summaries, often with a bit of hand-waving, and some of the newer post-processing techniques are only out in paper form now.)&lt;br /&gt;&lt;br /&gt;Where does aliasing come from? It comes from decisions that are made "per-pixel", in particular (1) whether a pixel is inside or outside a triangle and (2) whether a pixel meets or fails the alpha test.&lt;br /&gt;&lt;br /&gt;Texture filtering will not alias if the texture is mip-mapped; since the texel is pulled out by going "back" from a screen pixel to the texture, as long as we have mip-mapping, we get smooth linear interpolation.  (See Texture AA below.)&lt;br /&gt;&lt;h3&gt;Universal Techniques&lt;/h3&gt;&lt;span style="font-weight: bold;"&gt;Super-Sampled Anti-Aliasing (&lt;a href="http://en.wikipedia.org/wiki/Supersampling"&gt;SSAA&lt;/a&gt;).&lt;/span&gt;  The oldest trick in the book - I list it as universal because you can use it pretty much anywhere: forward or deferred rendering, it also anti-aliases alpha cutouts, and it gives you better texture sampling at high anisotropy too.  Basically, you render the image at a higher resolution and down-sample with a filter when done.  Sharp edges become anti-aliased as they are down-sized.&lt;br /&gt;&lt;br /&gt;Of course, there's a reason why people don't use SSAA: it costs a fortune.  Whatever your fill rate bill, it's 4x for even minimal SSAA.&lt;br /&gt;&lt;h3&gt;Hardware FSAA Techniques&lt;/h3&gt;These techniques cover the entire frame-buffer and are implemented in hardware.  You just ask the driver for them and go home happy - easy!&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Multi-Sampled Anti-Aliasing (&lt;a href="http://en.wikipedia.org/wiki/Multisample_anti-aliasing"&gt;MSAA&lt;/a&gt;).&lt;/span&gt;  This is what you typically have in hardware on a modern graphics card.  The graphics card renders to a surface that is larger than the final image, but in shading each "cluster" of samples (that will end up in a single pixel on the final screen) the pixel shader is run only once.  We save a ton of fill rate, but we still burn memory bandwidth.&lt;br /&gt;&lt;br /&gt;This technique does not anti-alias any effects coming out of the shader, because the shader runs at 1x, so alpha cutouts are jagged.  This is the most common way to run a forward-rendering game.  MSAA does not work for a deferred renderer because lighting decisions are made &lt;span style="font-style: italic;"&gt;after&lt;/span&gt; the MSAA is "resolved" (down-sized) to its final image size.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Coverage Sample Anti-Aliasing (&lt;a href="ftp://download.nvidia.com/developer/SDK/Individual_Samples/DEMOS/Direct3D9/src/CSAATutorial/docs/CSAATutorial.pdf"&gt;CSAA&lt;/a&gt;).&lt;/span&gt;  A further optimization on MSAA from NVidia.  Besides running the shader at 1x and the framebuffer at 4x, the GPU's rasterizer is run at 16x.  So while the depth buffer produces better anti-aliasing, the intermediate shades of blending produced are even better.&lt;br /&gt;&lt;h3&gt;2-d Techniques&lt;/h3&gt;The above techniques can be thought of as "3-d" because (1) they all play nicely with the depth buffer, allowing hidden surface removal and (2) they all run &lt;span style="font-style: italic;"&gt;during &lt;/span&gt;rasterization, so the smoothing is correctly done between different parts of a 3-d model.  But if we don't need the depth buffer to work, we have other options.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://glprogramming.com/red/chapter06.html#name2"&gt;&lt;span style="font-weight: bold;"&gt;Antialiased Primitives&lt;/span&gt;&lt;/a&gt;.  You can ask OpenGL to anti-alias your primitives as you draw them; the only problem is that it doesn't work.  Real anti-aliased primitives aren't required by the spec, and modern hardware doesn't support them.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://homepage.mac.com/arekkusu/bugs/invariance/TexAA.html"&gt;&lt;span style="font-weight: bold;"&gt;Texture Anti-Aliasing&lt;/span&gt;&lt;/a&gt;.  You can create the appearance of an anti-aliased edge by using a textured quad and buffering your texture with at least one pixel of transparent alpha.  The sampling back into your texture from the screen is done at sub-pixel resolution and is blended bilinearly; the result will be that the 'apparent' edge of your rendering (e.g. where inside your quad the opaque -&amp;gt; alpha edge appears) will look anti-aliased.  Note that you must be alpha blending, not alpha testing.&lt;br /&gt;&lt;br /&gt;If you're working in 2-d I strongly recommend this technique; this is how a lot of X-Plane's instruments work.  It's cheap, it's fast, the anti-aliasing is the highest quality you'll see, and it works on all hardware.  Of course, the limit is that this isn't compatible with the Z buffer.  If you haven't designed for this solution a retro-fit could be expensive.&lt;br /&gt;&lt;h3&gt;Post-Processing Techniques&lt;/h3&gt;There are a few techniques that attempt to fix aliasing as a post-processing step.  These techniques don't depend on what was drawn - they just "work".  The disadvantages of these techniques are the processing time to run the filter iself (e.g. they can be quite complex and expensive) and (because they don't use any of the real primitive rendering information) the anti-aliasing can be a bit of a loose cannon.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Morphological Anti-Aliasing (&lt;a href="http://www.realtimerendering.com/blog/morphological-antialiasing/"&gt;MLAA&lt;/a&gt;)&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;Fast Approximate Anti-Aliasing (&lt;a href="http://timothylottes.blogspot.com/2011/03/nvidia-fxaa.html"&gt;FXAA&lt;/a&gt;).&lt;/span&gt;  These techniques analyze the image after rendering and attempt to identify and blur out stair-stepped patterns.  ATI is providing an MLAA post-process as a driver option, which is interesting because it moves us back to the traditional game ecosystem where full screen anti-aliasing just works without developer input.&lt;br /&gt;&lt;br /&gt;Edit: See also Directionally Localized Anti-Aliasing (&lt;a href="http://and.intercon.ru/releases/talks/dlaagdc2011/"&gt;DLAA&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;(From a hardware standpoint, full screen anti-aliasing burns GPU cycles and sells more expensive cards, so ATI and NVidia don't want gamers to not have the option of FSAA.  But most new games are deferred now, making MSAA useless.  By putting MLAA in the driver, ATI gets back to burning GPU to improve quality, even if individual game developers don't write their own post-processing shader.)&lt;br /&gt;&lt;br /&gt;It is not clear to me what the difference is between MLAA and FXAA - I haven't taken the time to look at both algorithms in detail.  They appear to be similar in general approach at least.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Temporal Anti-Aliasing (&lt;a href="http://en.wikipedia.org/wiki/Temporal_anti-aliasing"&gt;TAA&lt;/a&gt;).&lt;/span&gt;  This is a post process filter that blends the frame with the previous frame.  Rather than have more samples on the screen (e.g. a 2x bigger screen in all dimensions for SSAA) we use the past frame as a second set of samples.  The camera is moved less than one pixel between frames to ensure that we get different samples between frames.  When blending pixels, we look for major movement and try to avoid blending with a sample that wasn't based on the same object.  (In other words, if the camera moves quickly, we don't want ghosting.)&lt;br /&gt;&lt;h3&gt;Deferred-Rendering Techniques&lt;/h3&gt;This set of techniques are post-processing filters that specifically use the 3-d information saved in the G-Buffer of a deferred renderer.  The idea is that with a G-Buffer we can do a better job of deciding when to resample/blur.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Edge Detection and Blur.&lt;/span&gt;  These techniques locate the edge of polygons by looking for discontinuities in the depth or normal vector of a scene, and then blur those pixels a bit to soften jaggies.  This is one of the older techniques for anti-aliasing a deferred renderer - I first read about it in &lt;a href="http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter09.html"&gt;GPU Gems 2&lt;/a&gt;.  The main advantage is that this technique is dirt cheap.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Sub-pixel Reconstruction Anti-Aliasing (&lt;a href="http://research.nvidia.com/publication/subpixel-reconstruction-antialiasing"&gt;SRAA&lt;/a&gt;).&lt;/span&gt; This new technique (published by NVidia) uses an MSAA G-Buffer to reconstruct coverage information.  The G-Buffer is MSAA; you resolve it and then do a deferred pass at 1x (saving lighting) but then go back to the original 4x MSAA G-Buffer to edge detect.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3601782664667833559?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3601782664667833559/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/04/so-many-aa-techniques-so-little-time.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3601782664667833559'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3601782664667833559'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/04/so-many-aa-techniques-so-little-time.html' title='So Many AA Techniques, So Little Time'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-4817839031571180026</id><published>2011-04-22T11:22:00.003-04:00</published><updated>2011-04-22T11:32:58.096-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Threading'/><category scheme='http://www.blogger.com/atom/ns#' term='Quotes'/><title type='text'>I Love Surprises</title><content type='html'>Awesome quote from the &lt;a href="http://java.sun.com/docs/books/jls/second_edition/html/memory.doc.html"&gt;Java Language Spec&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;In the absence of explicit synchronization, an implementation is free to  update the main memory in an order that may be surprising. Therefore  the programmer who prefers to avoid surprises should use explicit  synchronization.&lt;br /&gt;&lt;/blockquote&gt;As you know, the &lt;a href="http://www.youtube.com/watch?v=cmCKJi3CKGE"&gt;premier loves surprises&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-4817839031571180026?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/4817839031571180026/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/04/i-love-surprises.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4817839031571180026'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4817839031571180026'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/04/i-love-surprises.html' title='I Love Surprises'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3293666553936613888</id><published>2011-03-10T08:10:00.002-05:00</published><updated>2011-03-10T08:11:13.793-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><title type='text'>DECAFBAD</title><content type='html'>Makes you write code like this...&lt;br /&gt;&lt;blockquote&gt;if ((ent = dynamic_cast&lt;igiscomposite*&gt;(what)) &amp;amp;&amp;amp; ent-&gt;GetGISClass() == gis_Composite) true;&lt;br /&gt;&lt;/blockquote&gt;At least C++ isn't judgmental.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3293666553936613888?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3293666553936613888/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/03/decafbad.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3293666553936613888'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3293666553936613888'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/03/decafbad.html' title='DECAFBAD'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-402639303826770852</id><published>2011-03-07T11:01:00.002-05:00</published><updated>2011-03-08T21:57:29.637-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Instancing Numbers</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-hhhKCrVhfG8/TXUBm5BLPbI/AAAAAAAAAts/aPSUnAN_WWc/s1600/X-Plane%2Bscreenshot_c4_204.png"&gt;&lt;img style="cursor: pointer; width: 339px; height: 254px;" src="http://2.bp.blogspot.com/-hhhKCrVhfG8/TXUBm5BLPbI/AAAAAAAAAts/aPSUnAN_WWc/s200/X-Plane%2Bscreenshot_c4_204.png" alt="" id="BLOGGER_PHOTO_ID_5581369080848006578" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;A quick stat on instancing performance.  There are a lot of OpenGL posts with developers posting their instancing performance numbers, and others asking, so here's X-Plane.&lt;br /&gt;&lt;br /&gt;On a 2.8 ghz Mac Pro (a few years old) with an ATI 4870 and OS X 10.6.6, we can push 87,000 meshes at just under 60 fps using instancing.  The average instance call is pushing 32 instances per draw call.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-402639303826770852?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/402639303826770852/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/03/instancing-numbers.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/402639303826770852'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/402639303826770852'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/03/instancing-numbers.html' title='Instancing Numbers'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-hhhKCrVhfG8/TXUBm5BLPbI/AAAAAAAAAts/aPSUnAN_WWc/s72-c/X-Plane%2Bscreenshot_c4_204.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3854246837182553032</id><published>2011-03-07T09:20:00.003-05:00</published><updated>2011-03-07T09:49:32.887-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Threading'/><category scheme='http://www.blogger.com/atom/ns#' term='Debugging'/><category scheme='http://www.blogger.com/atom/ns#' term='GDB'/><title type='text'>Don't Go Anywhere!</title><content type='html'>I'm debugging X-Plane's autogen engine.  In debug mode, with no inlining, optimizations, and a pile of safety checks, the autogen engine is not very fast.  Fortunately, my main development machine has 8 cores, and the autogen engine is completely thread-crazy.  The work gets spooled out to a worker pool and goes...well, about 8 times as fast.&lt;br /&gt;&lt;br /&gt;All is good and I'm sipping my coffee when I hit a break-point.  Hrm...looks like we have a NaN.  Well, we divided by a sum of some elements of a vector.  What's in the vector?&lt;br /&gt;&lt;blockquote&gt;print ag_block.spellings_s.[0].widths[1]&lt;br /&gt;&lt;/blockquote&gt;Ah...8 tiles.  At this point I am already dead.  If you've debugged threaded apps you already know what went wrong:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The array access operator in vector is really a function call (particularly in debug mode - we jam bounds checks in there).&lt;/li&gt;&lt;li&gt;GDB has to let the application 'run' to run the array operator, and at that instant, the sim's thread can switch.&lt;/li&gt;&lt;li&gt;The new thread will run until it hits some kind of break-point.&lt;/li&gt;&lt;li&gt;If you have 8 threads running the same operation, you will hit the break point you expect...but from the &lt;span style="font-style: italic;"&gt;wrong&lt;/span&gt; thread.&lt;/li&gt;&lt;/ul&gt;To say this makes debugging a bit confusing is an understatement.&lt;br /&gt;&lt;br /&gt;A brute force solution is to turn off threading - in X-Plane you can simply tell the sim that your machine has one core using the command line.  But that means slow load times.&lt;br /&gt;&lt;br /&gt;Fortunately gdb has these clever commands:&lt;br /&gt;&lt;blockquote&gt;set scheduler-locking on&lt;br /&gt;set scheduler-locking off&lt;br /&gt;&lt;/blockquote&gt;When you set scheduler locking on, the thread scheduler can't jump threads.  This is handy before an extended inspection session with STL classes.  You can apparently put the scheduler into 'step' mode, which will switch on run but not on step, but I haven't needed that yet.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3854246837182553032?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3854246837182553032/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/03/dont-go-anywhere.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3854246837182553032'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3854246837182553032'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/03/dont-go-anywhere.html' title='Don&apos;t Go Anywhere!'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-6027812747114186763</id><published>2011-03-06T18:41:00.004-05:00</published><updated>2011-03-06T18:44:53.643-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Quotes'/><category scheme='http://www.blogger.com/atom/ns#' term='NVidia'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>CSM for Dummies</title><content type='html'>This quote from NVidia's &lt;a href="http://developer.download.nvidia.com/GPU_Programming_Guide/GPU_Programming_Guide_G80.pdf"&gt;GPU Programming Guide&lt;/a&gt; amused me:&lt;br /&gt;&lt;blockquote&gt;There are many techniques available.  However, the general recommendation is&lt;br /&gt;that unless you know what you are doing you should just implement simple&lt;br /&gt;multi-tap cascaded shadow maps.&lt;br /&gt;&lt;/blockquote&gt;Or put another way:&lt;br /&gt;&lt;blockquote&gt;If you have no idea what the hell you're doing, try cascaded shadow maps -- what could go wrong?&lt;br /&gt;&lt;/blockquote&gt;Oh wait, X-Plane 10 uses CSM.  Well, I guess that's for the best...&lt;br /&gt;&lt;br /&gt;(The guide also suggests that "3 levels are sufficient to provide good shadow detail for any scene."  Have they &lt;span style="font-style: italic;"&gt;seen&lt;/span&gt; our scene graph?)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-6027812747114186763?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/6027812747114186763/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/03/csm-for-dummies.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6027812747114186763'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6027812747114186763'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/03/csm-for-dummies.html' title='CSM for Dummies'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5947824673470885433</id><published>2011-02-28T09:34:00.002-05:00</published><updated>2011-02-28T09:49:10.177-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Computational Geometry'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Order-Correct Translucency</title><content type='html'>When ATI released their &lt;a href="http://developer.amd.com/samples/demos/pages/atiradeonhd5800seriesrealtimedemos.aspx"&gt;order independent transparency demo&lt;/a&gt;, I nearly wet myself.  Translucency has been the bane of X-Plane authors for years.  The problem is that translucent surfaces remove hidden surfaces behind them, leading to artifacts.  The thought of on-hardware OIT was tantalizing.&lt;br /&gt;&lt;br /&gt;That is, until I found out how the tech works.  My understanding is that OIT is implemented by "writing your own back-end" - that is, instead of shading into a framebuffer, you write fragments into a 'deep' framebuffer by hand, using compute-shader-style ops to create linked lists of fragments.  (That is, fragments live in a general store and the framebuffer is really list heads.)  In a post processing pass, you go through the 'buckets' (that is, the linked lists) and sort out what you drew.&lt;br /&gt;&lt;br /&gt;That's a lot more back-end than I wanted...as a [spoiled, lazy] app developer I was hoping for glEnable(GL_MAGIC_OIT_EXT) - but no such luck.  The real issue is that, since our product already does a lot of 'back end' tricks within OpenGL, the cost of getting our shaders to run in a compute-style environment might be a bit high.  (This is looking less burdensome with some of the newer extensions, but it still seems to me that it would be difficult to port legacy apps to OIT-style rendering without having compute-shader features like atomic counters inside the GLSL shading environment.)&lt;br /&gt;&lt;br /&gt;As a side note, I also looked closely at depth peeling and even hacking the blend equation (e.g. accumulate and average) and both would probably be producable for X-Plane, which tends not to have &lt;span style="font-style: italic;"&gt;that&lt;/span&gt; much translucent overlap - the most common case for us is windows.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Traditional Approach - Automated&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Now the traditional approach to translucency in X-Plane goes something like this:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Force opaque drawing first.&lt;/li&gt;&lt;li&gt;Use one-sided drawing and order the translucent polygons so they appear from back to front from any viewpoint.&lt;/li&gt;&lt;/ul&gt;That second point is key: consider an airplane with windows.  If we draw the interior facing windows first and the exterior facing windows second, then from &lt;span style="font-style: italic;"&gt;any&lt;/span&gt; viewpoint, we are drawing 'back to front'.  This works because whenever we see two windows at once, we are seeing the inside window behind the outside one.  Isn't topology grand?&lt;br /&gt;&lt;br /&gt;Well, it turns out that this approach can be generalized: as long as none of our triangles intersect (except at their edges and corners), given any two triangles, we can always find a draw order between them that is correct.  Given a set of triangles, we can always sort the whole mesh to be appropriately back-to-front.  (At least, that's my theory until someone proves me wrong.)&lt;br /&gt;&lt;br /&gt;There are basically three cases:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Triangle B is fully on one side or the other of triangle A's plane.  B should be clearly before or after A depending on which side it's on.&lt;/li&gt;&lt;li&gt;Triangle A is fully on one side or the other of triangle B's plane.  A should be  clearly before or after B depending on which side it's on.&lt;/li&gt;&lt;li&gt;Triangle B and A are both on one side of each other's plane; we can use either triangle to determine correct order - they will not conflict.  (That is, this is a disjoint case, and either the two triangles are going to give you the same answer or they're facing in opposite directions and thus no visible at the same time.)&lt;/li&gt;&lt;/ol&gt;The fourth case would be two intersecting triangles - that's the case we can't necessarily get right.&lt;br /&gt;&lt;br /&gt;The ability to find this sort order depends on using one-sided triangles - this is what lets us decouple the sort order for two opposite directions.  By definition if a triangle is visible to a vector V, its back side is visible to -V.&lt;br /&gt;&lt;br /&gt;This approach of course doesn't solve all problems:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Animation can deform the mesh in a way that violates our correct order.&lt;/li&gt;&lt;li&gt;Multiple unrelated objects still need a relative ordering that makes sense.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Theoretical Angst&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Just a touch of angst...I'm no theoretician, and I can't help but wonder if there is a screwy case that this doesn't handle.  In particular, the sort order needs to be a strict weak ordering or we're going to get goofy results, and I'm not entirely sure that it is.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5947824673470885433?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5947824673470885433/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/order-correct-translucency.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5947824673470885433'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5947824673470885433'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/order-correct-translucency.html' title='Order-Correct Translucency'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-2449940063624213234</id><published>2011-02-19T11:09:00.004-05:00</published><updated>2011-02-19T11:11:45.387-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='Debugging'/><title type='text'>Mmm....C0FFEE.</title><content type='html'>I must just be late to the party, but: I just realized (approximately a decade later than I should) that C0FFEE can be spelled in hex.  How have I never seen a code base use this as a 'token word' (albeit with some high-bit junk)?  Actually you'd want to pad the low bits to make it odd too.&lt;br /&gt;&lt;br /&gt;The most common cookies I've seen in production code are 0BADF00D, DEADBEEF, and FEEDFACE.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-2449940063624213234?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/2449940063624213234/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/mmmc0ffee.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2449940063624213234'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2449940063624213234'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/mmmc0ffee.html' title='Mmm....C0FFEE.'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5092879069924311461</id><published>2011-02-15T15:22:00.003-05:00</published><updated>2011-02-15T15:29:33.929-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Wordpress'/><title type='text'>Permalink Shortcode</title><content type='html'>This must exist somewhere in WordPress or as part of a plugin, so if you know what plugin I should have used, feel free to heap on the abuse in the comments section.  Anyway...&lt;br /&gt;&lt;br /&gt;I wanted a way to create a link to a WP page from inside a post or page that wouldn't need modification when the target page's parent was changed.  WordPress's permalink scheme uses a hierarchy of parent/child/grandchild/ to identify pages, and this can change as a page is reparented.  If the parenting scheme is meant to represent a navigational hieararchy, you could have dead links.&lt;br /&gt;&lt;br /&gt;I ended up with this function, loosely based on snippets I found on the web:&lt;br /&gt;&lt;code&gt;function permalink_func( $atts, $content=null ) {&lt;br /&gt;       extract( shortcode_atts( array(&lt;br /&gt;               'p' =&gt; '1',&lt;br /&gt;       ), $atts ) );&lt;br /&gt;       if ($content == null)&lt;br /&gt;               $content = get_the_title($p);&lt;br /&gt;       $link = get_permalink($p);&lt;br /&gt;     &lt;br /&gt;       return "&lt;a href="http://www.blogger.com/%5C"&gt;$content&lt;/a&gt;";&lt;br /&gt;}&lt;br /&gt;add_shortcode( 'prm', 'permalink_func' );&lt;/code&gt;&lt;br /&gt;(It lives in functions.php inside a php block.)&lt;br /&gt;&lt;br /&gt;The short code is used like this: [prm p=1] or [prm p=6]link title[/prm].  If the short code is used with no closing tag, the article's title is used to label the link.  The parameter (p=17) is the ID of the page, and can be seen by mousing over the page or post in the admin interface.  The URL generated by the shortcode matches the current permalink scheme.&lt;br /&gt;&lt;br /&gt;Once again, I am amazed by how easy it is to get things done with WordPress.  It doesn't seem right...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5092879069924311461?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5092879069924311461/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/permalink-shortcode.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5092879069924311461'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5092879069924311461'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/permalink-shortcode.html' title='Permalink Shortcode'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3369187572699203683</id><published>2011-02-10T09:31:00.002-05:00</published><updated>2011-02-10T09:38:25.729-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Wordpress'/><title type='text'>Random Wordpress Notes</title><content type='html'>We're converting our website to WordPress (which I continue to be impressed by, but that'll be another post).  One or two random notes.&lt;br /&gt;&lt;br /&gt;If you put your news feed 'on a page' the page template is ignored - index.php is still used.  I am sure this is by design, but I discovered it while creating a custom template.  The page contents appear to be ignored too.&lt;br /&gt;&lt;br /&gt;If you have an existing site and you want to merge in WordPress, you can do this:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Host your news feed on a specific page, rather than letting it default to 'home'.&lt;/li&gt;&lt;li&gt;Change the WordPress URL base (not install base) to your site.&lt;/li&gt;&lt;li&gt;Put a mod_rewrite rule into your site root to rewrite missing files to /wp/index.php (or wherever WP is installed).&lt;/li&gt;&lt;li&gt;If you want to replace an existing HTML page with a WP page you can use a rewrite rule from the old name to something like /wp/index.php?page_id=20 (or whatever page ID you want).&lt;/li&gt;&lt;/ul&gt;This is similar to the normal 'changing base' rules for WP, except that you don't need to create a second index.php in your root folder - your old site's home page stays in place.&lt;br /&gt;&lt;blockquote&gt;RewriteEngine On&lt;br /&gt;RewriteRule news.html /wp/index.php?page_id=5 [L]&lt;br /&gt;RewriteCond %{REQUEST_FILENAME} !-f&lt;br /&gt;RewriteCond %{REQUEST_FILENAME} !-d&lt;br /&gt;RewriteRule . /wp/index.php [L]&lt;br /&gt;&lt;/blockquote&gt;mod_rewrite is pretty cryptic. Basically what this says is:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;If the user asks for news.html, go to WordPress article 5.&lt;/li&gt;&lt;li&gt;If the user asks for a missing file, let WordPress sort it out.&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3369187572699203683?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3369187572699203683/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/random-wordpress-notes.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3369187572699203683'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3369187572699203683'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/random-wordpress-notes.html' title='Random Wordpress Notes'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5033876471926323936</id><published>2011-02-04T18:51:00.007-05:00</published><updated>2011-02-04T20:32:33.651-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>G-Buffer Normals, Revisited</title><content type='html'>A while ago I posted the &lt;a href="http://hacksoflife.blogspot.com/2010/12/yet-another-this-is-our-gbuffer-format.html"&gt;G-buffer format&lt;/a&gt; for X-Plane 10, which, as of this writing is still in development. &lt;a href="http://sebh-blog.blogspot.com/"&gt;SebH&lt;/a&gt; brought up CryTek's &lt;a href="http://sebh-blog.blogspot.com/2010/08/cryteks-best-fit-normals.html"&gt;normal map compression&lt;/a&gt; and I hand-waived a bit and wondered to myself whether some kind of normal map goblin was going to pop up later in the development cycle.&lt;br /&gt;&lt;br /&gt;The short answer: yes.&lt;br /&gt;&lt;br /&gt;I will try to write up a post later describing the precision problems with normal maps in more detail, but for now I'll post the problem and its partial solution, while I still have the debug code in my shaders.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUySgVarhaI/AAAAAAAAAr8/SU3TqrT_b74/s1600/orig.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUySgVarhaI/AAAAAAAAAr8/SU3TqrT_b74/s200/orig.png" alt="" id="BLOGGER_PHOTO_ID_5569987923352978850" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUySgtKWMaI/AAAAAAAAAsE/qSygnLAP-uQ/s1600/uncomp.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUySgtKWMaI/AAAAAAAAAsE/qSygnLAP-uQ/s200/uncomp.png" alt="" id="BLOGGER_PHOTO_ID_5569987929726923170" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This is a Baron 58 that Tom Kyler is working on for version 10.  He is probably &lt;i&gt;not&lt;/i&gt; very happy that I'm posting pictures of it, because it's still in progress, and while I think it looks pretty good, our art guys get a lot of, um, "artsy goodness" into the models in the last few passes.  (If the lighting seems a little, um, bizarre, it probably is; lord knows what state of debug the sun shader was in when I took these pics.)&lt;br /&gt;&lt;br /&gt;The left side image is the airplane, lit by an evening sun that has just barely set, directly behind us, the right image is the fully reconstructed per pixel eye space normals.  The small icons show the rough contents of the four layers of our G-Buffer.&lt;br /&gt;&lt;br /&gt;So far things seem reasonably sane - the engine nacelle is lit from the side but not the top.  But here's where things go south:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_TrRVoYy3Itc/TUymNs0UURI/AAAAAAAAAs0/D9DOtkQqLnE/s1600/nrml_xz.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://1.bp.blogspot.com/_TrRVoYy3Itc/TUymNs0UURI/AAAAAAAAAs0/D9DOtkQqLnE/s200/nrml_xz.png" alt="" id="BLOGGER_PHOTO_ID_5570009593449566482" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUymN4pt89I/AAAAAAAAAs8/Y1q3AFCY0a4/s1600/nrml_xyz.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUymN4pt89I/AAAAAAAAAs8/Y1q3AFCY0a4/s200/nrml_xyz.png" alt="" id="BLOGGER_PHOTO_ID_5570009596626334674" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This is a wing with a light on the leading edge.  The surface normal of the wing is almost perpendicular to the light direction, which really stress-tests the quality of our normal vectors.  The first picture is the 'classic' G-Buffer technique: 16-bit-float dx and dy eye-space vectors, with Z reconstructed in shader.  As you can see, it develops banding at the very low edge of angle-based attenuation.  (Note that this area would be super-dark if we weren't in linear space.)  The second image shows the full XYZ normal (burning an extra G-Buffer 16-bit channel)...clearly this fixes the problem of reconstruction from low-precision sources, but channels are hard to come by in gbuffers.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUymOAk87sI/AAAAAAAAAtE/9elXpFcbAPc/s1600/nrml_lambert.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUymOAk87sI/AAAAAAAAAtE/9elXpFcbAPc/s200/nrml_lambert.png" alt="" id="BLOGGER_PHOTO_ID_5570009598753828546" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Fortunately I found this totally awesome write-up of different &lt;a href="http://aras-p.info/texts/CompactNormalStorage.html"&gt;normal compression&lt;/a&gt; schemes.  The above picture on the right is a &lt;a href="http://en.wikipedia.org/wiki/Lambert_azimuthal_equal-area_projection"&gt;Lambert Azimuthal Equal-Area Projection&lt;/a&gt; using two channels.&lt;br /&gt;&lt;br /&gt;Here are a few more pics of the gbuffer normal map, both projected and expanded:&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_TrRVoYy3Itc/TUynszvebJI/AAAAAAAAAtc/p7qoE1wjhfE/s1600/wide_lambert_recon.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://4.bp.blogspot.com/_TrRVoYy3Itc/TUynszvebJI/AAAAAAAAAtc/p7qoE1wjhfE/s200/wide_lambert_recon.png" alt="" id="BLOGGER_PHOTO_ID_5570011227395878034" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUyntO5gBNI/AAAAAAAAAtk/sPrZgDhTFXQ/s1600/wide_lambert_comp.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUyntO5gBNI/AAAAAAAAAtk/sPrZgDhTFXQ/s200/wide_lambert_comp.png" alt="" id="BLOGGER_PHOTO_ID_5570011234685682898" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUynsut6DaI/AAAAAAAAAtU/7yg2kCQtj0c/s1600/lambert_comp.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUynsut6DaI/AAAAAAAAAtU/7yg2kCQtj0c/s200/lambert_comp.png" alt="" id="BLOGGER_PHOTO_ID_5570011226047122850" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUynsgB65SI/AAAAAAAAAtM/_heL-9pAR8k/s1600/lambert_recon.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TUynsgB65SI/AAAAAAAAAtM/_heL-9pAR8k/s200/lambert_recon.png" alt="" id="BLOGGER_PHOTO_ID_5570011222104532258" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;Side benefit: Lambertian projection copes with negative eye space Z (but not 0,0,-1, which is unlikely even with tangent space normal maps on art assets) so no more hand-waving there.&lt;br /&gt;&lt;br /&gt;One last thought for now: this entire post refers to the 'normal map' layer of a g-buffer, that is, the saved per-pixel normal information.  Compression of 'normal map' textures for art assets is a bit of a different problem - the most immediate note is that they can be compressed off-line, so  non-realtime compression techniques are fair game.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5033876471926323936?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5033876471926323936/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/g-buffer-normals-revisited.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5033876471926323936'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5033876471926323936'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/g-buffer-normals-revisited.html' title='G-Buffer Normals, Revisited'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_TrRVoYy3Itc/TUySgVarhaI/AAAAAAAAAr8/SU3TqrT_b74/s72-c/orig.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3142508526353292085</id><published>2011-02-02T17:07:00.000-05:00</published><updated>2011-02-02T17:07:01.203-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Android'/><title type='text'>Losing Javadocs in Eclipse: SOLUTION</title><content type='html'>Occasionally and for reasons that I do not fully understand, Eclipse may lose track of your Javadocs. That means, when you mouse over an Android API call expecting to read about it, you'll get the dreaded "This element has no attached source and the Javadoc could not be found in the attached Javadoc" error message.&lt;br /&gt;&lt;br /&gt;The traditional method to solving this issue is to delete the eclipse .metadata directory. This does in fact work (I tried it) but it also requires you to redownload and setup the Android ADT...plus you lose ALL of your eclipse settings and preferences. If you're like me and have custom fonts and syntax highlighting setup, then that's a nuisance.&lt;br /&gt;&lt;br /&gt;The "right way" (and by right way I mean "this worked for me and i didn't lose data) to solve this problem is to follow these instructions:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;In eclipse, right click on your Android project and select Properties&lt;/li&gt;&lt;li&gt;On the menu on the left, select "Java Build Path"&lt;/li&gt;&lt;li&gt;On the right hand side, select the "tab" labelled "Libraries".&lt;/li&gt;&lt;li&gt;Here you should see the Android SDK that you're targeting. For example: "Android 2.2".&lt;/li&gt;&lt;li&gt;Click on the arrow to the left of the Android SDK to expand the sublevels.&lt;/li&gt;&lt;li&gt;Find "Android.jar" and click on the arrow to the left of that one as well to expand it.&lt;/li&gt;&lt;li&gt;You'll see a setting called "Javadoc location". Select that and then click on the "Edit" button.&lt;/li&gt;&lt;li&gt;At the top, RESELECT the path to your javadocs. This is usually "path_to_android_sdk/android-sdk-mac_86/docs/reference/". I say RESELECT because even if it's right, you should browse and do it over anyway.&lt;/li&gt;&lt;li&gt;Click on "validate". You should be all set now!&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3142508526353292085?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3142508526353292085/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/losing-javadocs-in-eclipse-solution.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3142508526353292085'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3142508526353292085'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/02/losing-javadocs-in-eclipse-solution.html' title='Losing Javadocs in Eclipse: SOLUTION'/><author><name>Chris</name><uri>http://www.blogger.com/profile/14648675681957285299</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='26' src='http://www.cjserio.com/blogger/uploaded_images/Chris.jpg'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3676024935025896350</id><published>2011-01-28T18:36:00.002-05:00</published><updated>2011-01-28T18:46:52.721-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='X-Plane'/><category scheme='http://www.blogger.com/atom/ns#' term='Game Development'/><category scheme='http://www.blogger.com/atom/ns#' term='Modeling'/><title type='text'>Is COLLADA a Win?</title><content type='html'>I'm always astounded to discover that anyone, like, reads this blog. But on the off chance that anyone with serious tool-chain/content-pipeline experience is reading this...&lt;br /&gt;&lt;br /&gt;Is COLLADA a win?&lt;br /&gt;&lt;br /&gt;Like most games, X-Plane has the problem of needing to get content from commercial 3-d modeling programs into our proprietary engine format, with annotated data attached that is specific to X-Plane.  We need to give our artists a way to attach such data (E.g. billboard properties, hard surface attributes, etc.) natively in their 3-d program and have that data make it into X-Plane.&lt;br /&gt;&lt;br /&gt;In fact, the problem is a bit worse for X-Plane because we are effectively an open platform for a whole range of third party developers; thus the world of artists are not on any one 3-d program.&lt;br /&gt;&lt;br /&gt;There are three ways I can see to solve this problem:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;Write a lot of export scripts.  This is the path we're on now.  We have full featured scripts for AC3D and Blender, and there are a lot of other scripts out there for other modelers.&lt;/p&gt;&lt;p&gt;The obvious problem with this approach is scalability.  Every new modeling feature in X-Plane has to be separately built into every single exporter; the result is invariably inconsistent support for the "full" file format, due to high development costs.  (I maintain one of the exporters, the AC3D one myself, and I am not up-to-date on my own modeling formats.)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Create a common, simple, proprietary interchange format.  One of the problems with writing X-Plane models is that you have to optimize them to maximize sim performance.  The idea of creating a more simple format that feeds into a processing tool would be to lower the cost of writing the actual modeling-program-specific export scripts.  Exporters would simply dump out a stream of "stuff" and the post processing tool would clean it.&lt;/p&gt;&lt;p&gt;We already do this with DSF, our scenery file format.  DSF is a horribly complex bit-packed format, but a tool (DSF2Text) will convert a simple text stream to the final binary using LR's libraries to do the compression and encoding.  While the DSF code itself is open source, a text file represents an easier API for a wide variety of languages, including scripting languages.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Use an off-the-shelf interchange format, hence the question about COLLADA.  In theory, the win would be that there would be existing export scripts for the interchange format, greatly reducing the time to implement support for a particular modeler.  A common COLLADA -&gt; OBJ converter would then do the final encode once for all programs.&lt;/p&gt;&lt;p&gt;In practice, the devil would be in the details: COLLADA is a very general, rich format; do all modelers support exporting to all COLLADA idioms?  Would there be appropriate 1:1 mappings from the 3-d program to X-Plane?&lt;/p&gt;&lt;p&gt;My concern is that it's bad enough trying to find ways to represent X-Plane concepts in a 3-d program; in order to use existing COLLADA code those concepts would have to be in the 3-d program in a way that is compatible with existing COLLADA export code.&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;Anyway, if you have experience (good or bad) with using COLLADA as an intermediate tool-chain step, I'd love to hear about it; it strikes me as an option for gaining leverage over tool-chain costs whose real value would be entirely determined by the details of implementation.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3676024935025896350?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3676024935025896350/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/is-collada-win.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3676024935025896350'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3676024935025896350'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/is-collada-win.html' title='Is COLLADA a Win?'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7541112559020820126</id><published>2011-01-20T14:59:00.003-05:00</published><updated>2011-01-20T15:09:07.440-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><title type='text'>Derivatives III: I Ran Out of Rez</title><content type='html'>One more note on &lt;a href="http://hacksoflife.blogspot.com/2011/01/derivatives-ii-conditional-texture.html"&gt;derivatives&lt;/a&gt; in GLSL shaders: derivatives can run into &lt;a href="http://hacksoflife.blogspot.com/2010/02/running-out-of-derivative-res.html"&gt;precision problems&lt;/a&gt; that the underlying expressions don't have.&lt;br /&gt;&lt;br /&gt;Recall that derivatives are typically calculated by taking the actual difference of two values in two nearby pixels.  This means that the derivative is subject to the precision limits between two pixels.&lt;br /&gt;&lt;br /&gt;Consider a UV map spread over a really huge distance, say, 5 km in-game.  The texture is 1024 x 1024 and therefore each texel is about 10m in size.&lt;br /&gt;&lt;br /&gt;What happens if we zoom way the heck in so that one game pixel (5m) is covering nearly all of the screen?&lt;br /&gt;&lt;br /&gt;We need about 10 bits of precision to select our texel; any remaining precision can be used to take an interpolated position between texels for filtering.  If our monitor res is about 1024x768, we're using 20 bits of precision in our UV map, and we have only three bits left.  If we zoom in any more, we may reach a point where our interpolated UV map doesn't have enough precision to provide a UV position for each pixel. &lt;br /&gt;&lt;br /&gt;(In other words, if we have less than 10 bits of precision left between the left and right side of the screen, then some adjacent pixels will have the &lt;span style="font-style: italic;"&gt;same&lt;/span&gt; UV coordinates!)&lt;br /&gt;&lt;br /&gt;Now this generally doesn't matter for texture sampling.  We're sampling 1024 unique mixes between two pixels, and we can only show 256 shades on the screen - in practice if two pixels have the same UV coordinates, it doesn't matter, because the amount of RGB change per pixel is not perceivable anyway.&lt;br /&gt;&lt;br /&gt;But for our derivatives, it's a different story: some pairs of pixels will have a zero derivative, and some will have a non-zero derivative!  Even if we don't run out res, we're very low on res and our derivatives may be 'chunky' or in other ways screwed up.  If we need to &lt;a href="http://hacksoflife.blogspot.com/2009/11/per-pixel-tangent-space-normal-mapping.html"&gt;reconstruct basis vectors&lt;/a&gt; from our derivatives, those basis vectors are going to be a train wreck.&lt;br /&gt;&lt;br /&gt;The only solution I have found to this problem has been to replace GLSL built-in derivatives with an algorithmic derivative function.  Fortunately the only cases where we have ridiculous UV mapping is when the texture coordinates are generated by formula, and thus a similar formula can be used to create the derivatives.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7541112559020820126?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7541112559020820126/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/derivatives-iii-i-ran-out-of-rez.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7541112559020820126'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7541112559020820126'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/derivatives-iii-i-ran-out-of-rez.html' title='Derivatives III: I Ran Out of Rez'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-6024946098614382599</id><published>2011-01-20T14:42:00.003-05:00</published><updated>2011-01-20T14:59:27.290-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><title type='text'>Derivatives II: Conditional Texture Fetches</title><content type='html'>In my &lt;a href="http://hacksoflife.blogspot.com/2011/01/derivatives-i-discontinuities-and.html"&gt;previous post&lt;/a&gt; I described how OpenGL often calculates derivatives by differencing nearby pixels in a block.  This can cause problems if our UV map has discontinuities.&lt;br /&gt;&lt;br /&gt;Even weirder things happe if we use texture fetches inside an if statement. For example, this will produce some very weird results:&lt;br /&gt;&lt;blockquote&gt;if(uv.x &gt;= 0.0)&lt;br /&gt; gl_FragColor = texture2D(my_sampler,uv);&lt;br /&gt;else&lt;br /&gt; gl_FragColor = vec4(0.0);&lt;/blockquote&gt;You might think that if you use a texture a ramp of black on left, white on right, you'd get a ramp of texture and then the black texture would seamlessly transition into the hard-coded black from the else statement.&lt;br /&gt;&lt;br /&gt;If your GPU and GLSL compiler are in a forgiving mood, this may work; if they are not, you may get a set of mid-gray artifact pixels at the transition point.  The problem is this bit of fine print (from the &lt;a href="http://www.opengl.org/registry/doc/GLSLangSpec.Full.1.20.8.pdf"&gt;GLSL 1.20.8 spec, section 8.8&lt;/a&gt;):&lt;br /&gt;&lt;blockquote&gt;The method may assume that the function evaluated is continuous.  Therefore derivatives within the body of a non-uniform conditional are undefined.&lt;br /&gt;&lt;/blockquote&gt;You can't take a derivative inside an if statement.  (But since the results are undefined, the GPU can make your life more difficult by sometimes giving you useful results anyway. )  Recall from my past post that a texure2D fetch is like a texture2DGrad with derivatives of the texture coordinate expression.  Since the derivative functions are invalid inside if statements, the derivatives passed to texture2D may be junk.  In other words, this is bad:&lt;br /&gt;&lt;blockquote&gt;if(stuff)&lt;br /&gt;gl_FragColor = texture2D(tex,uv,dFdx(uv),dFdy(uv));&lt;br /&gt;&lt;/blockquote&gt;but this is okay:&lt;br /&gt;&lt;blockquote&gt;float dx = dFdx(uv);&lt;br /&gt;float dy = dFdy(uv);&lt;br /&gt;if(stuff)&lt;br /&gt;gl_FragColor = texture2DGrad(tex,uv,dx,dy);&lt;/blockquote&gt;In other words, you have to use texture2DGrad to move the derivative calculation out of the if statement.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Why Can't the GPU Get This Right (Except When It Does)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Artifacts due to incorrect derivative calculations inside incoherent texture fetches (that is, some pixels texture fetch, nearby ones don't, the derivative is hosed, and our texture fetch is therefore hosed) are definitely sensitive to the hardware, GLSL compiler, and driver, and I ended up switching out my Radeon and GeForce about 30 times before I wrapped my head around this issue.&lt;br /&gt;&lt;br /&gt;This doesn't surprise me.  The spec allows undefined behavior.  Recall that the derivative is based on differencing the value of an expression across a 2x2 pixel group.  To understand why conditionals and derivatives don't mix, we have to understand how modern GPUs handle conditional rasterization.&lt;br /&gt;&lt;br /&gt;(What follows is based on &lt;a href="http://hacksoflife.blogspot.com/2010/12/fun-with-glsl-compilers.html"&gt;my reading some docs on R700 assembly&lt;/a&gt;; it is best to think of it as a model for how GPUs can work, more or less; I am sure there are lots of subtleties to the R700 that I don't understand.)&lt;br /&gt;&lt;br /&gt;The GPU rasterizes pixels in 2x2 blocks, with the same shader executed on four execution units in lock-step.  That is, each pixel has its own intermediate registers and state, but all four pixels run the same instructions.&lt;br /&gt;&lt;br /&gt;When the shader hits an if statement, the hardware sets a mask for each pixel indicating which pixels are "in" the if statement and which are not.  The entire if statement is run on all hardware, but the results for the pixels that are not in the if statement are thrown out due to the mask.&lt;br /&gt;&lt;br /&gt;If all four pixels hit the if statement the same way, only then can the GPU jump over the if statement, saving actual work.&lt;br /&gt;&lt;br /&gt;So what happens if the if statement is being evaluated for some pixels and not others and we take a derivative?  The answer is: lord knows!  The expression we are calculating may be only partly updated, incorrect, or totally unavailable for some of the pixels.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Branch Coherence&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As a side note, the property of the GPU to run the entire shader on all pixels when only some of them are using the if statement is why the GPU manufacturers will tell you that a conditional is only a performance win if it is &lt;i&gt;coherent&lt;/i&gt; - that is, if nearby pixels all branch in the same way.  This is because when nearby pixels branch in different ways, the GPU must run all code and throw out some of the results.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-6024946098614382599?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/6024946098614382599/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/derivatives-ii-conditional-texture.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6024946098614382599'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6024946098614382599'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/derivatives-ii-conditional-texture.html' title='Derivatives II: Conditional Texture Fetches'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3922725548080678576</id><published>2011-01-20T14:38:00.002-05:00</published><updated>2011-01-20T14:42:13.960-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><title type='text'>Derivatives I: Discontinuities and Gradients</title><content type='html'>The short of it is this: if you see 2x2 pixel artifacts in your shader, you might need texture2DGrad.  Now the long version.&lt;br /&gt;&lt;br /&gt;How does OpenGL know what mipmap level to use when you sample a texture  in your GLSL shader with texture2D?  The answer is that this:&lt;br /&gt;&lt;blockquote&gt;texture2D(my_texture,uv);&lt;br /&gt;&lt;/blockquote&gt;actually does something like this:&lt;br /&gt;&lt;blockquote&gt;texture2DGrad(my_texture,uv,dFdx(uv),dFdy(uv));&lt;br /&gt;&lt;/blockquote&gt;In other words, texture2D takes the derivative of your  input texture coordinates and uses those derivatives to decide which  mipmap level to access.  The larger the derivatives, the lower mipmap  level.  (The actual implementation is more complicated.)&lt;br /&gt;&lt;br /&gt;Before continuing, a brief exercise in visualization.  Imagine a cube  with a single square face visible to us (parallel to the screen).  The  cube face is textured with a single 256x256 texture. If we zoom the  camera so that the cube takes 256x256 screen pixesl, the derivative of  the UV map between any two pixels on screen is about 1/256 in both  directions, and we want the highest level mipmap.  If we zoom out so  that the cube takes up only 2x2 pixels, the derivative is about 1.0 in  both directions - and we want the lowest mipmap level.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Where Do Derivatives Come From?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The GLSL derivative functions are usually implemented by differencing -  that is, the GPU takes a block of 2x2 pixels and differences the  variable or expression passed to dFdx and dFdy, to calculate an  'approximate' derivative.  Many GPUs rasterize 2x2 clusters of pixels at  a time, with the shader instructions for the four pixels run in  lock-step, so the hardware can be set up to efficiently "cross" the four  texels to find our derivatives.&lt;br /&gt;&lt;br /&gt;This means that if there is a discontinuity between those pixels, the  derivative may be, well, surprising.  For example, consider something  like this:&lt;br /&gt;&lt;blockquote&gt;vec2 uv = gl_TexCoord[0].st;&lt;br /&gt;if(uv.x &gt; 0.5) uv.y += 0.25;&lt;br /&gt;gl_FragColor = texture2D(my_sampler, uv);&lt;/blockquote&gt;What happens if  two of the pixels in our 2x2 block have uv.x &gt; 0.5 and the other two  don't?  well, the answer is that uv.y will be 0.25 bigger for some but  not all textures, and the derivative of uv.y will be very big!  This in  turn will cause texture2D to fetch a low mipmap level, much lower than  any other 2x2 pixels that are "coherent".  (Coherent here means all 4  pixels have the same boolean answer to the if conditional.)&lt;br /&gt;&lt;br /&gt;One way to think of this is: since the derivatives are found by looking  at actual pixels on screen, a discontinuity is seen by the derivative  function as a really low-res UV map, and thus a low mipmap level is  selected.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Fixing The Derivative&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So what can we do?  We can provide OpenGL with an expression whose  derivative is about the same as our real texture coordinates, but  without discontinuities.  For example, we can rewrite our above example  like this:&lt;br /&gt;&lt;blockquote&gt;vec2 uv = gl_TexCoord[0].st;&lt;br /&gt;if(uv.x &gt; 0.5) uv.y += 0.25;&lt;br /&gt;gl_FragColor = texture2DGrad(my_sampler, uv,dFdx(gl_TexCoord[0].st),dFdy(gl_TexCoord[0].st));&lt;/blockquote&gt; Our actual texture samples come from a discontinuous UV map, but our derivative comes from the original continuous function.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Breaking Continuity&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I first ran across this while working on the 'tile' shader for X-Plane 10.  The tile shader breaks each texture into a sub-grid of tiles and then randomly swizzles the tiles, like a number puzzle that someone has been scrambled.  The tile shader hides repetitions in the shader, and (because it runs in shader) it doesn't require additionally tessellating geometry, saving vertex count.&lt;br /&gt;&lt;br /&gt;(Using fragment ops to save vertex count might seem strange, but in this case, our base mesh is already heavily cut up based on other criteria; having the texture swizzle run orthogonally lets us subdivide the mesh based on other, unrelated criteria.)&lt;br /&gt;&lt;br /&gt;Without texture2DGrad, we would get a set of 2x2 pixel dark pixels at the edge of the tiles.  The tiles are induced via some math that includes a floor() function to separate our tile number from our location within the tile.  The floor function can induce discontinuities even without conditional logic, because floor is not a continuous function.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3922725548080678576?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3922725548080678576/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/derivatives-i-discontinuities-and.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3922725548080678576'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3922725548080678576'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/derivatives-i-discontinuities-and.html' title='Derivatives I: Discontinuities and Gradients'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5394699245260165075</id><published>2011-01-08T13:22:00.004-05:00</published><updated>2011-01-08T13:25:52.974-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CVS Voodoo'/><title type='text'>Stupid CVS Tricks</title><content type='html'>I finally figured out (thanks to &lt;a href="http://durak.org/sean/pubs/software/cvsbook/Enabling-Watches-In-The-Repository.html"&gt;this&lt;/a&gt;) how to get CVS to notify us somewhere other than the mail service of the server it's running on.  See the link for instructions, but basically you can make a 'users' dictionary file that maps users to custom external email addresses...the file isn't in the default config (which is weird).&lt;br /&gt;&lt;br /&gt;Also, note that the CVS watch command can do two things:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;It can subscribe to events (edit, unedit and commit).  When you watch add yourself with some or all of these events, you get email (looked up via users).  Since watches go to specific subscribed users, the CVS notify file uses a wildcard to send to specific users.  (This is different from loginfo which sends to a list of everyone who cares about any commit for a given module.)&lt;/li&gt;&lt;li&gt;It can force a file to be checked out locked (thus forcing an edit/unedit workflow) using cvs watch on.  We only use this to force our X-Code project file to be checked out locked (to prevent the accumulation of lots of trivial project changes).&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5394699245260165075?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5394699245260165075/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/stupid-cvs-tricks.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5394699245260165075'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5394699245260165075'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/stupid-cvs-tricks.html' title='Stupid CVS Tricks'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-2009437875538280561</id><published>2011-01-04T15:02:00.003-05:00</published><updated>2011-01-04T15:16:21.997-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Threading'/><title type='text'>CAS and Reference Counting Revisited</title><content type='html'>A while ago I &lt;a href="http://hacksoflife.blogspot.com/2009/07/cas-and-reference-counting-dont-mix.html"&gt;suggested&lt;/a&gt; that we can't use atomic compare and swap (CAS) and reference counting together to update an arbitrary data structure because we can't atomically dereference the pointer and increase our reference count at the same time.  The problem is that there is an instant while we are 'off the end' of the pointer that we haven't increased our reference count; an updater would have no idea that throwing out the copy of the data we are using is a poor idea.&lt;br /&gt;&lt;br /&gt;Here's  possible work-around: use the low bit of the pointer we want to CAS as a "lock" bit and spin.  The algorithm would go something like this:&lt;br /&gt;&lt;blockquote&gt;read begin:&lt;br /&gt;while(1){ // this spins if someone else is trying to&lt;br /&gt;ret = ptr // read-begin.&lt;br /&gt; if(ptr, ptr | 1)) break&lt;br /&gt;}&lt;br /&gt;atomic_inc(&amp;amp;ptr-&gt;ref_count);&lt;br /&gt;CAS(ptr, ptr &amp;amp; ~1);&lt;br /&gt;return ret&lt;br /&gt;&lt;br /&gt;read_end(data)&lt;br /&gt;if atomic_dec(&amp;amp;data-&gt;ref_count)==0)&lt;br /&gt; delete data&lt;br /&gt;&lt;br /&gt;update:&lt;br /&gt;while(1)&lt;br /&gt; old = read_begin // take a ref count because we are copying from data&lt;br /&gt; create new copy from old&lt;br /&gt; if CAS(old,new){&lt;br /&gt;    assert(atomic_dec(&amp;amp;old-&gt;ref_count) &gt; 0);&lt;br /&gt;    read_end(old) // two decs,&lt;br /&gt;    break&lt;br /&gt; } else {&lt;br /&gt;   read_end(old)  // we failed to swap, retry.*&lt;br /&gt;   delete new }&lt;br /&gt;&lt;/blockquote&gt;The idea here is that we can enforce spinning for a short time on other readers while we read the pointer to our block of data by always CASing in the low bit.  (This assumes that memory is at least 2-byte aligned, which is an acceptable design assumption on pretty much all modern machines.)  Thus we are holding a spin lock while we get our reference count registered.&lt;br /&gt;&lt;br /&gt;This code assumes that the data we are protecting has a baseline reference count of one - thus once an updater has replaced it, the updater removes this 'baseline' count, and the last reader to stop counting it (and an updater is a reader) releases it.&lt;br /&gt;&lt;br /&gt;One reason why I only come back to RCU-style algorithms every few months is that (at least for this one) there's no way to block to ensure that the old copy of the data has been fully released.  Knowing that your update is "fully committed" to all thread contexts is an important property that this algorithm does not have.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-2009437875538280561?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/2009437875538280561/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/cas-and-reference-counting-revisited.html#comment-form' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2009437875538280561'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2009437875538280561'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2011/01/cas-and-reference-counting-revisited.html' title='CAS and Reference Counting Revisited'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5783835254458797019</id><published>2010-12-22T15:47:00.004-05:00</published><updated>2010-12-22T16:13:18.741-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><title type='text'>Fun With GLSL Compilers</title><content type='html'>I've been poking at GPU &lt;a href="http://developer.amd.com/gpu/shader/pages/default.aspx"&gt;ShaderAnalyzer&lt;/a&gt;; this Windows performance tool from ATI gives you a simple environment: you enter GLSL, hit compile, and look at the assembly that pops out.  Perfect!  (It also shows you execution cycles, but for my purposes, namely understanding what the compiler does for me, the assembly is key.)&lt;br /&gt;&lt;br /&gt;Here are some things I have learned:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;p&gt;RV790 assembly is quite complex.  (Thank goodness I don't have to code it myself.)  ALU instructions consist of 5 scalar sub-instructions, only one of which can have transcendental opcodes.  There's a bit of fine print; &lt;a href="http://developer.amd.com/gpu/ATIStreamSDK/assets/R600-R700-Evergreen_Assembly_Language_Format.pdf"&gt;this&lt;/a&gt; and &lt;a href="http://developer.amd.com/gpu_assets/R700-family_instruction_set_architecture.pdf"&gt;this&lt;/a&gt; make useful reading.  One thing to note: the ALU has a number of 'small' tricks (absolute value, negation, clamping) 'for free'.  Sometimes the compiler will use these tricks, sometimes not.&lt;/p&gt;&lt;p&gt;Generally, if you write vectorized code (e.g. uniform work on vec4) the scheduling will work out nicely.  But the units of execution really are scaler, so it doesn't make sense to write work that isn't needed.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The compiler inlines pretty much everything, which is just fine by me.  (I have no idea if recursion is legal in GLSL, I'd never use it in production, but when I wrote a recursive factorial function, the compiler simply inlined 127 iterations and called it a day.  Awesome!)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The compiler understands a reasonable amount of constant folding, including (most importantly) multiply by zero.  For example: I write an expensive albedo function: pow(gl_Color,gl_Color.aaaa) and multiply it by a light function that returns vec4(0.0).  The result: the compiler nukes the entire code sequence and simply loads 0.&lt;/p&gt;&lt;p&gt;(BTW, pow is expensive - since only one of the five ALU slots can run log and exp, raising each color channel to a non-constant power takes eight instruction groups!  Ouch.)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The compiler will remove conditional code when the condition is fully known at compile time.  So for example, an if statement where the comparison comes from functions that return constants will be nuked, and one of its two clauses is deleted.&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;/li&gt;&lt;li&gt;The compiler does not seem to do inference, at least in the one case I looked at.  By inference I mean: if (max(0.6,gl_FragColor.r) &gt; 0.3) will (ignoring NaN logic) always be true, regardless of gl_FragColor.  But for the compiler to know this, it has to make an inference - that is, it has to compare the range [0.6..inf) with 0.3.  My understanding is that LLVM can do this kind of thing, but when I tried it in shader I simply got the full, expensive, conditional code.  Moral of the story: use and apply your human brain. :-)&lt;/li&gt;&lt;/ul&gt;Now...why do I care?&lt;br /&gt;&lt;br /&gt;X-Plane's physical shader is based on conditional compilation - that is, for any given state vector of "tricks" we want to use, we recompile the shader with some #defines at the front which turn features on and off.  The result is a large number of shaders, none of which need conditional logic in-shader.  Fill rate isn't consumed by features we don't use.  (This technique comes from our original use of GLSL to emulate and then improve on the fixed function pipeline.  To match fixed-function performance, we had to 'compile out' anything we didn't use, particularly for first-gen DX9 hardware which doesn't give you conditional logic for free.&lt;br /&gt;&lt;br /&gt;The problem with this technique (and you can see this in the X-Plane 9 shaders) is that it doesn't scale well with code size.  For version 10 we've done a lot of shader work, and hand-optimizing the conditional logic is getting more and more difficult.&lt;br /&gt;&lt;br /&gt;My conclusion from observing the compiler is that 99% of the time, I can relax a little bit and let the compiler take care of optimizing the shaders down.  In particular, if I define functions for each stage of the shader and use conditional compilation to 'simplify' the rule, then the simple cases will boil down to very few instructions.  For example:&lt;br /&gt;&lt;blockquote&gt;float calc_spec()&lt;br /&gt;{&lt;br /&gt;#if has_spec&lt;br /&gt;     return pow(max(0.0,dot(eye_nrm,sun-vec)),128.0);&lt;br /&gt;#else&lt;br /&gt;     return 0.0;&lt;br /&gt;#endif&lt;br /&gt;}&lt;br /&gt;void main()&lt;br /&gt;{&lt;br /&gt;....&lt;br /&gt;float s = calc_spec();&lt;br /&gt;gl_FragColor = albedo * lighting * shadow + ambient + shadow * vec4(s,s,s,0.0);&lt;br /&gt;}&lt;/blockquote&gt;In this mess, our specularity function is subject to conditional removal for non-shiny materials.  When we do this, not only is the actual specularity calc removed, but the compiler will figure out that 's' will alwys be 0.0 and nuke the MAD of shadow * specularity into the final lighting sum.&lt;br /&gt;&lt;br /&gt;That's a trivial example, but it shows the principle of structuring the components separately and letting the compiler put the mess together.&lt;br /&gt;&lt;br /&gt;As a final note: the compiler's optimization is not perfect; I suspect the above technique will 'leak' a few instructions in the simple cases relative to a one-off carefully hand-coded GLSL shader, and the GLSL isn't going to be quite as tight in a few cases as actually writing assembly. &lt;br /&gt;&lt;br /&gt;But I can live with that, most of the time.  We can always go and hand tune the performance cases that absolutely matter most, and the time saved working on the huge mess that is the conditional shader gives me the time to do that hand optimization where it is most important.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5783835254458797019?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5783835254458797019/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/fun-with-glsl-compilers.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5783835254458797019'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5783835254458797019'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/fun-with-glsl-compilers.html' title='Fun With GLSL Compilers'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-446341442678762833</id><published>2010-12-20T23:01:00.001-05:00</published><updated>2010-12-20T23:03:37.572-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><title type='text'>Lisp Isn't a Language...</title><content type='html'>Someone sent me &lt;a href="http://www.junauza.com/2010/12/top-50-programming-quotes-of-all-time.html"&gt;this fun list&lt;/a&gt; of programming quotes.  From Alan Kay:&lt;br /&gt;&lt;blockquote&gt;Lisp isn't a language, it's a building material.&lt;br /&gt;&lt;/blockquote&gt;And that material is &lt;a href="http://en.wikipedia.org/wiki/Cow_dung#Uses"&gt;dung&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-446341442678762833?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/446341442678762833/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/lisp-isnt-language.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/446341442678762833'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/446341442678762833'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/lisp-isnt-language.html' title='Lisp Isn&apos;t a Language...'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-658886256339210659</id><published>2010-12-09T14:09:00.002-05:00</published><updated>2010-12-09T14:25:45.824-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><title type='text'>What OOP Isn't</title><content type='html'>When I took my first computer science class (I had already been programming on my own for a while) the department was going through something of a civil war.  Some of the department had gotten religion in the form of object oriented programming (OOP) and were trying to thrust it on everything.  They made us implement a linked list in an OOP way: every node was fully encapsulated!&lt;br /&gt; &lt;br /&gt;(If you're wondering how this works, most list editing operations had to be stack-recursive - a node would set its 'next' to the return value of a call on its next, allowing a node to 'cut itself out'.  It made it very hard for students who had never used linked lists to understand what was going on, because they had to learn recursion and linked lists at the same time.  The result was something with the performance of LISP and elegance of C++.  It was horrible.)&lt;br /&gt;&lt;br /&gt;They told us that OOP involved encapsulation, polymorphism, and inheritance; I have commented in the past on why this last idea is &lt;a href="http://hacksoflife.blogspot.com/2007/01/inheritance-of-implementation-is-evil.html"&gt;often just a poor idea&lt;/a&gt;.  At the time, in school I only had enough programming experience to say that what we were being taught (all OOP, all the time, period) was a lot more difficult than what I had been doing (use an object for something big, like a game character) and was producing code that was more verbose and not particularly fast.  Now that I have some software engineering experience, I think i can articulate the problem more precisely.&lt;br /&gt;&lt;br /&gt;When talking to a new programmer who is looking at OOP and trying to figure out what it's all about, I say that the relative importance of encapsulation, polymorphism, and inheritance is approximately 90%, 10%, 0% respectively.  The vast majority of the value of OOP is that it provides an idiom to keep one piece of code from becoming hopelessly intertwined with other pieces of code, and that's valuable in large software projects.  It's also impossible to teach to undergraduates because they never have a chance to write enough code for it to matter.&lt;br /&gt;&lt;br /&gt;Polymorphism is nice, but in my experience it's not as useful as encapsulation.  If you have a polymorphic interface, you have an interface, which means that it's encapsulated...but there are plenty of cases where an interface is one-off and has no polymorphic properties.  Maybe 90%-10% is harsh, but I think it's the encapsulation that matters.  It may be that some product spaces are more polymorphic than others.  WorldEditor (LR's open source scenery editor) has polymorphic hierarchies for most of its core components, while X-Plane itself has very few.&lt;br /&gt;&lt;br /&gt;I bring this up because I'd like to advance (in a future blog post) a comparison of OOP techniques to others (for real software engineering problems), but OOP &lt;a href="http://en.wikipedia.org/wiki/Object-oriented_programming#Criticisms"&gt;comes with a bit of baggage&lt;/a&gt;.  The notion that OOP would make us better programmers, help us write bug free code faster, or help bad programmers become good programmers have all proven to be naively optimistic.  (In particular, bad programmers have proven to be surprisingly resourceful at writing bad code given virtually any  programming idiom.)&lt;br /&gt;&lt;br /&gt;So I'd like to define (OOP - hype) as something like: good language support for idioms that make encapsulation and sometimes polymorphic interfaces faster to code.  And that's useful to me!  I could code the same thing in pure C, but it would make my repetitive stress injuries worse from more typing, so why do that?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-658886256339210659?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/658886256339210659/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/what-oop-isnt.html#comment-form' title='12 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/658886256339210659'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/658886256339210659'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/what-oop-isnt.html' title='What OOP Isn&apos;t'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>12</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-9139618168529985065</id><published>2010-12-09T11:51:00.002-05:00</published><updated>2010-12-09T11:53:52.936-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='C'/><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><title type='text'>FMTT, GLSL Edition</title><content type='html'>This is in the same vein as &lt;a href="http://hacksoflife.blogspot.com/2010/11/i-hate-c-part-492.html"&gt;I Hate C&lt;/a&gt; -- all of its derivatives are contaminated with its brain damage.&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;gl_FragData[0] = vec4(tex_color.rgb * gl_Color.rgb*tex_color.a,clamp(tex_color.a + lit_color.a,0.0,1.0));       &lt;br /&gt;gl_FragData[1] = vec4(shiny_ao * cut_pos, cut_pos*position_eye.z/-1024.0, 0.0, cut_pos);&lt;br /&gt;gl_FragData[2] = vec4(normal_eye_use.xyz*cut_pos, cut_pos);&lt;br /&gt;gl_FragData[3] = vec4(lit_color.rgb + tex_color.rgb * gl_FrontLightModelProduct.sceneColor.rgb, (tex_color.a + lit_color.a,0.0,1.0));&lt;/code&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-9139618168529985065?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/9139618168529985065/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/fmtt-glsl-edition.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/9139618168529985065'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/9139618168529985065'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/fmtt-glsl-edition.html' title='FMTT, GLSL Edition'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-686888426331326284</id><published>2010-12-05T12:12:00.004-05:00</published><updated>2010-12-05T12:30:08.085-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Yet Another This-Is-Our-GBuffer-Format Post</title><content type='html'>If you read just about any presentation by a game studio (E.g. for GDC or Sigraph) on deferred rendering or deferred lighting, they'll probably discuss their G-Buffer  format and how they tried to pack as much information into a tiny space as possible.  X-Plane 10 will feature a deferred renderer (to back the global spill feature set).  And...here is how we pack our G-Buffer.&lt;br /&gt;&lt;br /&gt;The good news with X-Plane is that we don't have a complex legacy material system to support.  If a game is well-batched, the forward renderer can associate separate shaders with batches by material, and each 'material' can thus be radically different in how it handles/transfers light.  With a G-Buffer, the lighting equation must be unified and thus everything we need to know to support some common lighting format must go into the G-Buffer.  (One thing you can do is pack a material index into the G-Buffer.)  Fortuantely, since X-Plane doesn't have such a multi-shader beast we only had one property to save: a shininess ratio (0-1, about 8 bits of precision needed).&lt;br /&gt;&lt;br /&gt;What we did have to support was both a full albedo and a full RGB emissive texture; additive emissive textures have been in the sim for over a decade, and authors use them heavily.  (They can be modulated by datarefs within the sim, which makes them useful for animating and modulating baked lighting effects.)  X-Plane 10 also has a static shadow/baked ambient occlusion type term on some of the new scenery assets that needs to be preserved, also with 8 bits of precision.&lt;br /&gt;&lt;br /&gt;The other challenge is that X-Plane authors use alpha translucency fairly heavily; deferred renderers don't do this very well.  One solution we have for airplanes is to pull out some translucent surfaces and render them post-deferred-renderer as a forward over-draw pass.  But for surfaces that can't be pulled out, we need to do the least-bad thing.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Format&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Thus we pack our G-Buffer into 16 bytes:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;RGBA8 albedo (with alpha)&lt;/li&gt;&lt;li&gt;RG16F normal (Z vector is reconstructed)&lt;/li&gt;&lt;li&gt;RG16F depth + shadow-merged-with-shininess&lt;/li&gt;&lt;li&gt;RGBA8 emissive (with alpha)&lt;/li&gt;&lt;/ol&gt;When drawing into the G-Buffer, we use full blending for the albedo and emissive layers, but we always set the alpha for the depth/normal layers to 0.0 or 1.0, thus using the alpha blend as a per-render-target "kill" switch.  This is based on the level of translucency.  Thus if something is highly transparent, we keep the physical position of what is behind it (light passes through) but if it is opaque enough, we over-write the position/normal (light bounces off of it).  It's not perfect, but it's  the least bad thing I could come up with.&lt;br /&gt;&lt;br /&gt;(As a side note, if we had a different layout, we could blend the shininess ratio, for example, when we keep a physical position fragment, to try to limit shininess on translucent elements.)&lt;br /&gt;&lt;br /&gt;Note that on OS X 10.5 red-green textures are not available, so we have to fall back to four RGBA_16F textures, doubling VRAM.  This costs us at least 20% fill-rate on an 8800, but the quick sanity test I did wasn't very abusive; heavier cases are probably a bit worse.&lt;br /&gt;&lt;br /&gt;So far we seem to be surviving with a 16-bit floating point eye-space depth coordinate.  It does not hold up very well in a planet-wide render though, for rendering techniques where reconstructing the position is important (e.g. &lt;a href="http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter16.html"&gt;O'Neil-style atmospheric scattering&lt;/a&gt;).  A simple workaround would be to calculate the intersection of the fragment ray with the planet directly by doing a transform of the planet sphere from model-view space.  (E.g. if we know that our fragment came from a sphere, why not just work with the original mathematical sphere.)&lt;br /&gt;&lt;br /&gt;16F depth does give us reasonably accurate shadows, at least up close, and far away the shadows are going to be limited in quality anyway.  I tried logarithmic depth, but it made shadows worse.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Packing Shadowing and Shine&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For the static shadow/AO ("shadow") term and the level of specularity ("shine") we have two parameters that need about 8 bits of precision, and we have a 16-bit channel. Perfect, right?  Well, not so much.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;NVidia cards won't render to 16-bit integer channels on some platforms.&lt;/li&gt;&lt;li&gt;ATI cards don't export a bit-cast from integer to floating point, making it hard to pack a real 16-bit int (or two 8-bit ints) into a float.&lt;/li&gt;&lt;li&gt;If we change the channel to RGBA8 (and combine RG into a virtual 16F) we can't actually use the fourth byte (alpha) because the GL insists on dumping our alpha values.  Extended blend would fix this but it's not supported on OS X and even on Windows you can't use it with multiple render targets.&lt;/li&gt;&lt;/ul&gt;So we can't actually get the bits we pay for and that sucks.  But we can cheat.  The trick is: the deferred renderer will cut out specular hilights that are in shadow.  Thus as the shadow term becomes stronger, the shine term becomes unimportant.&lt;br /&gt;&lt;br /&gt;So we simply encode 256.0 * shadow + shine.  The resulting quantity gives shininess over 8 bits of precision without shadow, and reduces shininess to around 2 bits in full shadow.  If you view the decoded channels separately you can see banding artifacts come in on the shine channel as the shadows kick in.  But when you view the two together in the final shader, the artifacts aren't visible at all because the shadow masks out the banded shininess.&lt;br /&gt;&lt;br /&gt;(What this trick has done is effectively recycle the exponent as a 'mixer', allocating bits between the two channels.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Future Formats&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A future extension would be to add a fifth 4-byte target. This would give us room to extend to full 32-bit floating point depth (should that prove to be useful), with shadow and shine in 8 bit channels with one new 8-bit channel left.  Or alternatively we could:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Keep Z at 16F and keep an extra channel.&lt;/li&gt;&lt;li&gt;Technically if we can accept this 'lossy' packing we can get  four components into an RG16F, while we can only get 3 pure 8-bit components.  (This is due to how the GL manages alpha.)&lt;/li&gt;&lt;li&gt;If we target hardware that provides float-to-int bit casts, we could have four 8-bit components in the new channel.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-686888426331326284?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/686888426331326284/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/yet-another-this-is-our-gbuffer-format.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/686888426331326284'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/686888426331326284'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/yet-another-this-is-our-gbuffer-format.html' title='Yet Another This-Is-Our-GBuffer-Format Post'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5105166261344057021</id><published>2010-12-04T15:16:00.003-05:00</published><updated>2010-12-04T15:20:41.917-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Threading'/><category scheme='http://www.blogger.com/atom/ns#' term='Linux'/><title type='text'>Semaphore Follow-Up: NTPL</title><content type='html'>A quick follow-up from my &lt;a href="http://hacksoflife.blogspot.com/2010/12/performance-of-semaphore-vs-condition.html"&gt;previous post&lt;/a&gt; on condition variables, etc.  With &lt;a href="http://en.wikipedia.org/wiki/Native_POSIX_Thread_Library"&gt;NTPL&lt;/a&gt; (the pthreads implementation on Linux) a lot of the original issues I was trying to cope with don't exist.  Some things NTPL does:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;pthread mutexes are spin-sleep locks, so they can be used as short-term critical sections without too much trouble.  Given a moderately contested but shortly held lock, this is a win.&lt;/li&gt;&lt;li&gt;sem_t semaphores have an atomic counter to avoid system calls in the uncontested case.  When inited privately (sem_init) they appear to be lean and mean.&lt;/li&gt;&lt;li&gt;All synchronization is done around &lt;a href="http://en.wikipedia.org/wiki/Futex"&gt;futexes&lt;/a&gt;, ensuring that uncontested cases can be manged with atomic operations.  (The OS X pthreads library at least uses spin locks around user space book-keeping for the uncontested case, but I think the futex code path is faster.)&lt;/li&gt;&lt;/ul&gt;There is one case where using a condition variable really would be superior to a semaphore on Linux: if you really want a condition variable (and aren't just using it to build a semaphore).  In particular, the futex system call that helps NTPL sleep threads as needed has special operations to move a thread from one queue to another while asleep.  This fixes the thundering herd problem when every thread on a condition is woken up at once.  This isn't something I use, but if you need it, NTPL makes it fast.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5105166261344057021?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5105166261344057021/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/semaphore-follow-up-ntpl.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5105166261344057021'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5105166261344057021'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/semaphore-follow-up-ntpl.html' title='Semaphore Follow-Up: NTPL'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-4548255404136501757</id><published>2010-12-03T13:24:00.002-05:00</published><updated>2010-12-03T13:57:00.590-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Threading'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>Performance of Semaphore Vs. Condition Variable</title><content type='html'>I was looking at the performance of X-Plane's inter-thread messaging.  Inter-thread message costs haven't been a huge concern to us in the past because the jobs  send to worker threads tend to be quite large, amortizing overhead, and they are usually returned asynchronously, so we don't care about latency.&lt;br /&gt;&lt;br /&gt;X-Plane 10 features a threaded flight model (or rather, the flight model of all airplanes is executed in parallel).  This is a case where we do care about latency (since we can't proceed until the flight model completes*) and the job size is not that big, so overhead matters more.&lt;br /&gt;&lt;br /&gt;Our old message queue was based on a pthread condition variable.  It was taking about 200 µsec to launch all workers, and a Shark analysis showed what I can only describe as a "total train wreck" of lock contention.  The workers were waking up and immediately crashing into each other trying to acquire the mutex that protects the condition variable's "guts".&lt;br /&gt;&lt;br /&gt;I replaced the condition variable + mutex implementation of the message queue with the implementation that already ships on Windows: a critical section to protect contents plus a semaphore to maintain count.  I implemented the semaphore as an atomic wrapper around a mach semaphore.  (See the libdispatch semaphore code for an idea of how to do this.)  The results were about 2x better: 80 µsec to launch from sleep, about 45 µsec of which were spent signaling the mach semaphore (5-10 µsec per call) plus a little bit of overhead.&lt;br /&gt;&lt;br /&gt;So why was the new implementation faster?  I dug into the pthreads source to find out.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;What Happens With PThreads&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The pthread implementation on OS X (at least as of 10.5.x whose source I browsed, it's in libc btw) goes something like this:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A spin lock protects the "guts" of both a pthread mutex and a pthread condition variable.&lt;/li&gt;&lt;li&gt;An OS semaphore provides the actual blocking for a pthread mutex and a pthread condition variable.&lt;/li&gt;&lt;/ul&gt;Mutex lock/unlock is therefore not as good as a real spinlock (for the cases where spinlocks are, um, good) but it does have a fast path where, if the lock is uncontested, you can get in, take the lock and get out without ever having to make a system call.&lt;br /&gt;&lt;br /&gt;The condition variable sequence is a little bit trickier.  A signal to a condition variable with no waiters is a fast early exit, but we have to grab the spin lock twice if we signal someone.  The condition wait is the painful part.  pthread_cond_wait has to reacquire the locking mutex once the waiter is signaled.  This is the source of the deadlock; if a lot of messages are written to a lot of worker threads at once, the worker threads bottleneck trying to reacquire the mutex that is tied to the condition variable that woke them up.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Better With Semaphores&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;When using the atomic semaphore/critical section design, the advantage we get is significantly finer-grained locking.  The lock on our message queue now only protects the queue items themselves, and is never held while we are inside pthread code.  This gets us through significantly faster.  It is inevitable that if we are going to queue work to many threads via one queue, there's going to be some bottleneck trying to get the messages out.  This design minimizes the cost of getting a message out and thus minimizes the serialization path.&lt;br /&gt;&lt;br /&gt;(Shark doesn't have quite the res to measure it in the code example I have running now, but I suspect that some of the cost still in the system is the spin time waiting for messages to be pulled.)&lt;br /&gt;&lt;br /&gt;Part of the win here is that we are using a spin lock rather than a sleep lock on the message queue; while the pthread mutex will fast-case when uncontested, it will always sleep the thread if even two threads are trying to acquire the mutex.  If the protected piece of code is only a few instructions, a spin is a lot cheaper than the system call to block on a semaphore.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Better Throughput When Pre-Queued&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;One of the things that made the old design go haywire in the version 10 test case was that it was a two-level set of messsage queues, with the main job list containing a message to go check another queue for the actual work.  (Believe it or not there is a good design reason for this, but that's another post.)  In this two-level design the semaphore + critical section hits its fast case.  The second queue is already full by the time we start waking up workers.  Therefore the second queue's semaphore is positive, which means that the decrement in the message dequeue will hit the fast atomic operations case and be non-blocking.  The workers then simply have to grab the spin lock, pop the message, and be done.&lt;br /&gt;&lt;br /&gt;(Again, the pthread implementation can fast-case when the mutex is uncontested, but it doesn't spin.)&lt;br /&gt;&lt;br /&gt;To forestall the comments: this is not a criticism of the performance of pthread condition variables and mutices; performance could certainly also have been boosted by using a different set of pthread synchronizers.  It is mainly an observation that, because the pthread mutex is heavier than a spin lock, condition variables may be expensive when you wanted a light-weight lock but didn't get one.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Future Performance Boosts&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There is one additional performance boost I have been looking at, although I am not sure when I will put it into place.  The message queue (which is a lock around an STL list right now**) could be replaced with a ring buffer FIFO.  With the FIFO design we would maintain:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A pair of semaphores, counting filled and free entries.  Reading/writing blocks until the needed filled/free entry is available.&lt;/li&gt;&lt;li&gt;Atomic read/write pointers that can be advanced once the semaphore indicates that there is content/free space to advance to.&lt;/li&gt;&lt;/ul&gt;This design would have the advantage that it doesn't need to spin, with message read/write through the ring buffer being almost fully independent.  Furthermore, since we have atomics wrapping our semaphores, the non-blocking case (reading when there are messages, writing when there is free space) is -blocking and non-spinning.&lt;br /&gt;&lt;br /&gt;Two things stop me from dropping this in:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Our current message queue API assumes that messages can always be written and that writing is reasonably fast.  (The current write only needs to acquire a spin lock.)  If the ring buffer can fill, we could block indefinitely until space is available, or we have to expose to client code that writes can time out.  That's a pretty big change.&lt;/li&gt;&lt;li&gt;The current queue has a lock-and-edit operation that is used to optimize the case where queued jobs have to be flushed.  Since the ring buffer FIFO is lock free, we can't really lock it to inspect/edit it.&lt;/li&gt;&lt;/ul&gt;* We looked at implementing a truly asynchronous flight model, but the interaction between third party add-ons, the flight model, and the rendering engine was such that the synchronization issues were too large for a relatively small win (the reduction of latency of one flight model).   flight models against each other gets us most of the win in the cases where the flight model really costs us something.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-4548255404136501757?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/4548255404136501757/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/performance-of-semaphore-vs-condition.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4548255404136501757'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4548255404136501757'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/12/performance-of-semaphore-vs-condition.html' title='Performance of Semaphore Vs. Condition Variable'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5402262706630231303</id><published>2010-11-30T12:43:00.004-05:00</published><updated>2010-11-30T13:24:18.867-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>Is 1% A Lot?</title><content type='html'>When optimizing code, is a 1% optimization (that is, an optimization that reduces run time by 1% a lot)?  Well, yes and no.  For any code optimization, we have to look at two important factors: &lt;span style="font-style: italic;"&gt;leverage&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;repeatability&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Leverage&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Let's say our game spends 90% of the time drawing the 3-d world and 10% drawing the UI.  Your leverage ratios are therefore 0.9 for 3-d and 0.1 for the UI.  Those are the scaling factors that discount the value of your optimizations.  So a 1% optimization to the rendering engine will give you a 0.9% speed boost overall, while a 1% optimization to the UI will give you only a 0.1% speed boost overall.&lt;br /&gt;&lt;br /&gt;So my first answer is: 1% is not a lot unless you have a lot of leverage.   If your leverage ratio is 0.05 you don't have a lot of leverage, and even a 20% optimization isn't going to produce noticeable results.&lt;br /&gt;&lt;br /&gt;This is why using an adaptive sampling profiler like Shark is so important.  A Shark time profile is your code sorted by leverage.  Let's take a look at a real Shark profile to see this.&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_TrRVoYy3Itc/TPU_9rjGDfI/AAAAAAAAArY/jE9EpCGtzAQ/s1600/Picture%2B53.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 159px;" src="http://4.bp.blogspot.com/_TrRVoYy3Itc/TPU_9rjGDfI/AAAAAAAAArY/jE9EpCGtzAQ/s200/Picture%2B53.png" alt="" id="BLOGGER_PHOTO_ID_5545408845071126002" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;This is a Shark profile of X-Plane 9.62 on my Mac Pro in an external view with pretty much the default settings. I have used Shark's data mining features to collapse and clean the display.  Basically time spent in the sub-parts of the OpenGL driver have been merged into libGL.dylib and libSys is merged to whomever calls it.  We might not do this for other types of optimizations, but here we just want to see "who is expensive" and whether it's us or the GL.&lt;br /&gt;&lt;br /&gt;The profile is "Timed Profile (All Thread States)" for just the app.  This captures time spent blocking.  Since X-Plane's frame-rate (which is what we want to make fast) is limited by how long it takes to go around the loop including blocking, we have to look at blocking and focus on the main thread.  If we were totally CPU bound (e.g. other worker threads were using more than total available CPU resources) we might simply look at CPU use for all non-blocking threads.  But that's another profile.&lt;br /&gt;&lt;br /&gt;What do we see?  In the bottoms-up view we can see our top offenders, and how much leverage we get if we can improve them.  In order of maximum leverage:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;glDrawElements!  With 35.6% of thread time (that is, a leverage ratio of 0.356) the biggest single thing we could do to make X-Plane faster is to make glDrawElements execute faster.  0.356 is a huge number for an app that's already been beaten into an inch of its life with Shark for a few years.&lt;/p&gt;&lt;p&gt;This isn't the first time I have Sharked X-Plane, so I can tell you a little bit of the back story.  This Shark profile is on a GeForce 8800 running OS X 10.5.8; a lot of the time is OpenGL state sync in the driver (that is, the CPU preparing to send instructions to the card to change how it renders), and some is spent pushing vertex data in a less-than-efficient manner.  This number is a lot smaller on 10.6 or an ATI card.&lt;/p&gt;&lt;p&gt;While this call isn't in our library, we still have to ask: can we make it faster?  The answer, as it turns out, is yes.  When you look at what glDrawElements is doing (by removing the data mining) you'll see most of its time is spent in gldUpdateDispatch.  This is the internal call to resynchronize OpenGL state.  So it's really us changing OpenGL state that causes most of the time spent in glDrawElements.  If we can find a way to change less state, we get a win.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Our quad-tree comes up next with 12.9%.  That's still a big number for optimization.  If we could make our quad tree faster, we might notice.&lt;/p&gt;&lt;p&gt;Once again, I have looked at this issue before; in the case of the quad tree the problem is L2 cache misses - that is, the quad tree is slow because the CPU keeps waiting to get pieces of the quad tree from main memory.  If we could change the allocation pattern or node structure of the quad tree to have better locality, we might get a win.&lt;/p&gt;&lt;p&gt;(How do we know it's an L2 cache issue?  Shark will let you profile by L2 cache misses instead of time.  If a hot spot comes up with &lt;span style="font-style: italic;"&gt;both&lt;/span&gt; L2 and time, it indicates that L2 misses might be the problem . If a hot spot comes up with time but not L2 misses, it means you're not missing cache, something else is making you slow.)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Third on the list is a real surprise - this routine tends the 3-d meshes and sees if anything needs processing.  To be blunt, this stat is a surprise and almost certainly a bug, and I did a double-take when I saw it.  More investigation is needed; while 7.6% isn't the biggest performance item, this is an operation that shouldn't even be on the list, so there might be a case where fixing one stupid bug gets us a nice win.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;plot_group_layers is doing the main per-state-change drawing - it's not a huge win to optimize because the leverage isn't very high and the algorithm is already pretty optimal - that is, real work is being done here.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Mipmap generation is done by the driver, but we if we can use some textures without auto-mipmap generation, it might be worth it.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;glMapBufferARB is an example of why we need to use "all thread states" - this routine can block, and when it does, we want to see why our fps is low - because our rendering thread is getting nothing done.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;glBegin is down at 2% (leverage ratio 0.02), and this is a good example of cost-benefit trade-offs.  X-Plane still has some old legacy glBegin/glEnd drawing code.  That code is old and nasty and could certainly be made faster with modern batched drawing calls.  But look at the leverage: 0.02.  That is, if we were able to improve &lt;span style="font-style: italic;"&gt;every single case of glBegin&lt;/span&gt; by a huge factor (imagine we made it 99% faster!!) we'd still see only a 2% frame-rate increase.&lt;/p&gt;&lt;p&gt;Now 2% is better than not having 2%, bu it's the quantity of code that's the issue.  We'd have to fix &lt;span style="font-style: italic;"&gt;every&lt;/span&gt; glBegin to get that win, and the code might not even be that much faster.  Because the code is so spread out and the leverage is low, we let it slide.  (Over the long term glBegin code will be replaced, but we're not going to stop working on real features to fix this now.)&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;The profile also shows a top-down view.  This view gives you a strategic view of where the overall leverage might be.  We're spending 77% of our time in scenery.  Right there we can say: most of the leverage is in the scenery engine - a lot more than might be in the flight model.  (In fact, the entire flight model is only 5.7%.)  Most of this time is then in the DSF.  The airplane shows up (9.6%) but within it, the vast majority is the OBJs atttached to the airplane.&lt;br /&gt;&lt;br /&gt;In fact, if you look at the two profiles together, the low level leverage and strategic view start to make sense . Most of that glDrawElements call is due to OBJs drawing, so it's no surprise that the airplane shows up because it has OBJs in it.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Repeatability&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So the value of an optimization is only as good as its leverage, right?  Well, not quite.  What if one of my shaders can be optimized by 50%, but it's one of ten shaders?  Well, that optimization could be worth the full 50% if I &lt;span style="font-style: italic;"&gt;repeat&lt;/span&gt; the optimization on the other shaders.&lt;br /&gt;&lt;br /&gt;In other words, if you can keep applying a trick over and over, you can start to build up real improvement even when the leverage is low.  Applying an optimization to multiple code sites is a trivial example; more typically this would be a process.  If I can spend a few hours and get a 1% improvement in shader code, that's not huge.  But if I can do that every day for a week, that might turn into a 10% win.&lt;br /&gt;&lt;br /&gt;So to answer the question "is 1% a lot", the answer is: yes if the leverage is there and you're going to keep beating on the code over and over.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5402262706630231303?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5402262706630231303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/is-1-lot.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5402262706630231303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5402262706630231303'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/is-1-lot.html' title='Is 1% A Lot?'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_TrRVoYy3Itc/TPU_9rjGDfI/AAAAAAAAArY/jE9EpCGtzAQ/s72-c/Picture%2B53.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5905710963988724640</id><published>2010-11-29T21:22:00.002-05:00</published><updated>2010-11-29T21:38:06.581-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Basis Projection</title><content type='html'>In my &lt;a href="http://hacksoflife.blogspot.com/2010/11/change-of-basis-revisited.html"&gt;previous post&lt;/a&gt; I described a transform matrix as a sort of kit of "premade parts" that each vertex's components call out.  Thus a vertex of (1, 0.5, 0) means "use 1 measure of the X basis from the matrix, 0.5 measures of the Y basis, and skip the Z basis, we don't want any of that."  You can almost see a matrix as a form of encoding or compression, where we decode using the "premade parts" that are basis vectors.  (I must admit, this is not a very good form of compression, as we don't save any memory.)&lt;br /&gt;&lt;br /&gt;So if our model's vertices are encoded (into "model space") and a model positioning matrix decodes our model into world space (so we can show our model in lots of places in the world), how do we &lt;span style="font-style: italic;"&gt;encode&lt;/span&gt; that model?  If we have one house in world space, how do we encode it into its own model space?&lt;br /&gt;&lt;br /&gt;One way to do so would be "basis projection".  We could take the dot product of each vector in our model* with each basis vector, and that ratio of correlation would tell us how much of the encoding vector we want.  What would a matrix for this encoding look like?&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;[x][Xx Xy Xz]&lt;br /&gt;[y][Yx Yy Yz]&lt;br /&gt;[z][Zx Zy Zz]&lt;/code&gt;&lt;/blockquote&gt;where we are encoding into the vector X, Y, and Z, described in the world coordinates.&lt;br /&gt;&lt;br /&gt;So we have put our basis vectors for our model into the rows of the matrix to encode, while last time we put them into the columns to decode.&lt;br /&gt;&lt;br /&gt;Wait - this isn't surprising.  The encode and decode matrices are inverses and the inverse of an orthogonal matrix is its transpose.&lt;br /&gt;&lt;br /&gt;Putting this together, we can say that our orthogonal matrix has the old basis in the new coordinate system in its columns at the same time as it has the new basis in the old coordinate system in its rows.  We can view our matrix through either the lens of decoding (via columns - each column is a "premade part") or encoding (via rows - how closely do we fit that "premade part").&lt;br /&gt;&lt;br /&gt;Why bring this up? The previous post was meant  to serve two purposes:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;To justify some particularly sparse implementations of camera operations.  (E.g. how did we get the code down to so few operations, why is this legal?)&lt;/li&gt;&lt;li&gt;To try to illustrate the connection between the symbols and geometric meaning of matrices and vectors.&lt;/li&gt;&lt;/ol&gt;The observation that basis projection and application are two sides of the same coin is a segue into my next post, which will be on the core foundation of spherical harmonics, which is basically the same basis trick, except this time we are going to get compression - a lot of it.&lt;br /&gt;&lt;br /&gt;* As in the previous article, we can think of our model as a series of vectors from the origin to the vertices in our mesh.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5905710963988724640?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5905710963988724640/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/basis-projection.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5905710963988724640'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5905710963988724640'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/basis-projection.html' title='Basis Projection'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5012153151989887007</id><published>2010-11-28T22:41:00.008-05:00</published><updated>2010-11-29T15:12:15.049-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Change of Basis, Revisited</title><content type='html'>A while ago I suggested that we could find the billboard vectors (that is, the vectors aligned with the screen in model-view space) simply by looking at the matrix itself.  A commenter further pointed out that we could simply transpose the upper 3x3 of our model-view matrix to invert the rotational component of our matrix.  Let's look at these ideas again.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Transform Matrix As Basis Vectors&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If we have a 3x3 matrix T of the form:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;[a d g]&lt;br /&gt;[b e h]&lt;br /&gt;[c f i]&lt;/code&gt;&lt;/blockquote&gt;Then when  we multiply a vector (x,y,z) by this matrix T, x gives us more or less of (a,b,c), y of (d,e,f) and z of (g,h,i).  In other words, you can think of x, y, and z being orders for premade "amounts" of 3 vectors.  In the old coordinate system, x gave us one unit of the 'x' axis, y one unit of the 'y' axis, and z one unit of the 'z' axis.  So we can see (a,b,c) as the old coordinate system's "x" axis expressed in the new coordinate system, etc.&lt;br /&gt;&lt;br /&gt;This "change of basis" is essentially an arbitrary rotation about the origin - we are taking our model and changing where its axes are.  Use the data with the old axes and you have the model rotated.  So far everything we have done is with vectors, but you can think of a point cloud as a series of vectors from the origin to the data points.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Big Words&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A lot of math for computer programmers is just learning what the mathematicians are talking about - math has a lot of vocabulary to describe ideas.  Sometimes the words are harder than the ideas.&lt;br /&gt;&lt;br /&gt;Our rotation matrix above is a set of &lt;span style="font-style: italic;"&gt;orthonormal&lt;/span&gt; basis vectors. (We call the matrix an &lt;span style="font-style: italic;"&gt;orthogonal matrix&lt;/span&gt;.) What does that mean?  It means two things:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Each basis vector is normal - that is, its length is 1.&lt;/li&gt;&lt;li&gt;Each basis vector is orthogonal to all other basis vectors - that is, any two basis vectors are at right angles to each other.&lt;/li&gt;&lt;/ul&gt;We'll come back to these mathematical properties later, but for now, let's consider what this means for our view of a transform matrix as using "premade pieces" of the coordinate axes.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Our model isn't going to change size, because each basis vector is of length 1 (and in the original coordinate system, 1 unit is 1 unit by definition).&lt;/li&gt;&lt;li&gt;Because the axes are all orthogonal to each other, we're not going to get any squishing or dimension loss  - a cube will stay cubic.  (This would not be true if we had a projection matrix!  Everything we say here is based on fairly limited use of the model-view matrix and will not be even remotely true for a projection matrix.)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Translation Too&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If we want to reposition our model completely (rotate and translate) we need a translation component.  In OpenGL we do that like this:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;[              x]&lt;br /&gt;[  R     y]&lt;br /&gt;[          z]&lt;br /&gt;[0 0 0 1]&lt;/code&gt;&lt;/blockquote&gt;In this matrix, R is our 3x3 rotation matrix, which is orthogonal, x y z is the offset to apply to all points.  If we apply this to vectors of the form (x,y,z,1) then the math works out to first rotate the points x,y,z by r, and then add x,y,z after rotation.  The last coordinate 'w' will remain "1" for future use.  The upper right 3x3 matrix is orthogonal, but the whole matrix is not; if this were a 4-dimensional basis change, the vector (x,y,z,1) would not necessarily be normalized, nor would it be perpendicular to all other bases.&lt;br /&gt;&lt;br /&gt;There's a name for this too: an &lt;span style="font-style: italic;"&gt;affine&lt;/span&gt; transformation matrix.  If we ever have a liner transform (of which a set of orthonormal basis vectors is one) plus a translation, with 0....1 on the bottom, we have an affine transformation.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Affine Transformation As Model Position&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If we have an affine transform matrix built from three orthonormal basis vectors (our old axes in the new coordinate system) plus an offset, we have everything we need to position a model in 3-d space.  Imagine we have a house model, authored as points located around an origin.  We want to position it at many locations in our 3-d world and draw it each time.  We can build the transform matrix we want.  Typically we'd do a sequence like:&lt;br /&gt;&lt;blockquote&gt;glTranslate(x,y,z);&lt;br /&gt;glRotate(heading,0,1,0);&lt;br /&gt;glRotate(pitch,1,0,0);&lt;br /&gt;glRotate(roll,0,0,1);&lt;/blockquote&gt;That is, first move the models origin to this place in world space, then rotate it.  This forms an affine matrix where the right most column is x,y,z,1 and the left three columns are the location of the model's X, Y, and Z axes in world space (with 0 in the last digit).  The bottom row is 0 0 0 1.&lt;br /&gt;&lt;br /&gt;Usually it's cheaper storage-wise to store the component parts of a model's transform than the entire transform matrix.  But if we do want to store the object in the format of a transform, we do know a few things:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The bottom row is 0 0 0 1 so we can simply delete it, cutting our matrix down from 16 floats to 12.&lt;/li&gt;&lt;li&gt;We can get the object "location" in world-space directly out of the right-hand column.&lt;/li&gt;&lt;li&gt;The upper-left 3x3 matrix contains all of the rotations.&lt;/li&gt;&lt;/ul&gt;If we have only used rotations and not mirroring, we can decode the Euler angles of the rotation from this upper left 3x3; that'll have to be another post.&lt;br /&gt;&lt;br /&gt;The main reason to store model location as a matrix (and not the original offset and rotation angles) is for hardware instancing; we can stream a buffer of 12-float matrices to the GPU and ask it to iterate over the mesh using something like &lt;a href="http://www.opengl.org/registry/specs/ARB/draw_instanced.txt"&gt;GL_ARB_draw_instanced&lt;/a&gt;.  But if you're on an embedded platform and stuck in fixed point, replacing angles (which require trig to decode) with the matrix might also be a win.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;glScale Wasn't Invited To The Party&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We cannot scale our model using glScale and still have an orthonormal basis; doing so with any scale values other than 1 or -1 would make the basis vectors be non-unit-length, and then it would not be orthonormal.  We can have mirroring operations - there is no requirement that an orthonormal basis maintain the "right-handedness" of a coordinate system.&lt;br /&gt;&lt;br /&gt;With X-Plane we don' do either of these things (scale or mirror); in the first case, we don't need it, and it's a big win to know that your upper left 3x3 is orthonormal - it means you can use it to transform normal vectors directly.  (If the upper left 3x3 of your model-view scales uniformly, your normals change length, which must be fixed in shader.  If your model-view scales non-uniformly, the direction of your normals get skewed.)  We don't mirror because there's no need for it, and for a rendering system like OpenGL that uses triangle direction to back-face-cull, changing the coordinate system from right-handed to left-handed would require changing back-face culling to match.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Stupid Matrix Tricks&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are a few nice mathematical properties of  orthogonal matrices.  First, the transpose of the matrix is its inverse.  To demonstrate this, take an orthogonal matrix and multiply it by its transpose.  You'll find that every component is either the dot product of two orthogonal vectors or a unit length vector with itself, thus it forms the 0s and 1s of an identity matrix.&lt;br /&gt;&lt;br /&gt;That's cool right there - it means we can use a fast operation (transpose) instead of a slow operation (invert) on our upper left 3x3.  It also means that the inverse of an orthogonal matrix is orthogonal too.  (Since the inverse is the transpose, you can first invert your matrix by transposing, then multiply that new matrix by its transpose, which is the original, and multiply the components out - you'll find the same identity pattern again.)&lt;br /&gt;&lt;br /&gt;If an orthogonal matrix's inverse is orthogonal, so is its transpose, and from that you can show that multiplying two orthogonal matrices forms a new orthogonal matrix - that is, batching up orthogonal matrix transforms preserves the orthognality.  (You can calculate the components of the two matrices, and calculate the transpose from the multiplication of the transposes of the sources in opposite order.  When you manipulate the algebra, you can show that the multiplication of the two orthogonal matrices has its transpose as its inverse too.)&lt;br /&gt;&lt;br /&gt;If we know that an orthogonal matrix stays orthogonal when we multiply them, we can also show that affine matrices stay affine when we multiply them.  (To do this, apply an affine matrix to another affine matrix and note that the orthogonal upper left 3x3 is only affected by the other matrices' upper 3x3, and the bottom row stays 0 0 0 1.)  That's handy because it means that we can pre-multiply a pile of affine transform matrices and still know that it's affine, and that all of our tricks apply.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Camera Transforms&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Camera transforms are funny beasts: when we move the camera, we do the opposite transforms of moving the model, and we do them in the opposite order.  So while we positioned our model by first translating, then rotating, we position our camera by first rotating, then translating, and we do everything in the opposite order.  (Consider if we want to position the camera at 10,0,0 in world space, we really achieve this by moving the model to -10,0,0.)&lt;br /&gt;&lt;br /&gt;This means we can't recover information about our camera location directly from the modelview matrix.  But thanks to all of the properties above, we can make camera angle recovery cheap.&lt;br /&gt;&lt;br /&gt;Our orthogonal matrix contains the location of the old coordinate system's axes (in terms of the new coordinate system) as columns.  But since its transpose is its inverse, it also contains the location of the new coordinate sytem's axes (in terms of the old coordinate system) as rows.  Our model view matrix transforms from world space to camera space.  So we have the axes of the camera space (that is, the "X" axis of camera space/eye space is an axis going to the right on your monitor) in terms of world space, right there in the rows.  Thus we can use the first, second and third row of the upper left 3x3 of an affine transform matrix to know the billboard vectors of an affine modelview matrix.&lt;br /&gt;&lt;br /&gt;The location of the camera is slightly trickier.  The right most column isn't the negative of the position of the camera. Why not?  Well, remember our "model positioning" transform was a translate-then-rotate matrix, with the translation in the right column.  But camera transforms happen in the opposite order (negative-, then negative-translate).  So the location of the camera is already "pre-rotated".  Doh.&lt;br /&gt;&lt;br /&gt;Fortunately we have the inverse of the rotation - it's just the transpose of the orthogonal 3x3 upper left of our affine matrix.  So to restore the camera location from a model-view matrix we just need to multiply the upper right 1x3 column (the translation) by the transpose of the upper left 3x3 (the rotation) and then negate.&lt;br /&gt;&lt;br /&gt;Having the camera direction and location in terms of the modelview matrix is handy when we have code that applies a pile of nested transforms.  If we have gone from world to aircraft to engine coordinates (and the engine might be canted), we're in pretty deep.  What are the billboarding vectors?  How far from the engine are we?  Fortunately we can get them right off of the modelview matrix.&lt;br /&gt;&lt;br /&gt;EDIT: one quick note: an affine transform is still affine even if the upper left matrix isn't orthogonal; it only needs to be linear.  But if we do have an orthogonal matrix in the upper left of our affine matrix, multiplying two of these "affine-orthogonal" matrices together does preserve both the affine-ness of the whole matrix and the orthogonality of the upper left.  I'm not sure if there is an official name for an affine matrix with an orthogonal upper left.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5012153151989887007?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5012153151989887007/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/change-of-basis-revisited.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5012153151989887007'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5012153151989887007'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/change-of-basis-revisited.html' title='Change of Basis, Revisited'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-4825073209966286451</id><published>2010-11-26T11:10:00.002-05:00</published><updated>2010-11-26T11:23:12.572-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='STL'/><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>More STL Abstraction</title><content type='html'>A while ago I made the claim that the &lt;a href="http://hacksoflife.blogspot.com/2010/02/stl-is-not-abstraction.html"&gt;STL is not an abstraction &lt;/a&gt;because the specification it implements is so specific with regard to performance that it's really just an implementation.  Response was swift and furious, and my contrast of the STL to something like an SQL query may have been lost. &lt;br /&gt;&lt;br /&gt;Which begs the question: why is anyone reading this?  What the heck?  Chris and I have been terrified to find anything from this blog on the first page of a Google search or reposted somewhere.  So if you're reading this: there will be no refunds at the end of this article if you feel you've lost 5 minutes of your life you'll never get back.  You have been warned.&lt;br /&gt;&lt;br /&gt;Anyway, I was looking at one case of the STL where you actually don't know what kind of performance you'll get unless you know things about your specific implementation.  When you insert a range into a vector, the insert code has two choices:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Iterate the range once, incrementally inserting.  This may cause excess reallocation (since we don't know how much more memory we need until we're done), and that reallocation may in turn call excess copy construction and destruction.&lt;/li&gt;&lt;li&gt;Measure the distance of the range.  This lets us do one allocation and a minimum number of copies, but if the iterator doesn't have random access differencing, it would mean the range is iterated twice.&lt;/li&gt;&lt;/ol&gt;The GCC 4 STL that we use will only pick choice 2 if the iterator is known to be random access.  (I haven't scrubbed the traits system to see how good it is at determining this when iterators don't use appropriate tags.)  Whether this is a win or not can't be known by the STL, as it doesn't know the relative cost of iteration vs. object copy vs. memory allocation.&lt;br /&gt;&lt;br /&gt;I have seen other cases where you can't quite guess what STL performance might be. For example, some implementations cache list size for O(1) list::size() while others do not cache and have to actually traverse the list.  The SGI STL documentation does declare what the worst behavior is, so I have no right to complain if the list size isn't cached.&lt;br /&gt;&lt;br /&gt;My argument isn't that the STL should always do the right thing by reading my mind.  My argument is that because the STL is such a low level of abstraction, and because it serves such a low level purpose in code, the performance of the implementation matters.  There may not be one right container for the job, and in trying to decide between a vector and list, whether I get single-allocate-insert on vector or constant-time size on the list might matter.&lt;br /&gt;&lt;br /&gt;Fortunately in real development this turns out to be moot; if performance matters, we'll run an adaptive sampling profiler like Shark on the app, which will tell us whether for our particular usage and data the STL is under-performing.  In a number of cases, our solution has been to toss out the STL entirely for something more efficient; as long as that's on the table we're going to have to profile, which will catch STL implementation differences too.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-4825073209966286451?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/4825073209966286451/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/more-stl-abstraction.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4825073209966286451'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4825073209966286451'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/more-stl-abstraction.html' title='More STL Abstraction'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7268261190704414734</id><published>2010-11-23T14:22:00.007-05:00</published><updated>2010-11-23T14:32:47.783-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>The Value Of Gamma Compression</title><content type='html'>A number of engineers all had the same response to the sRGB/Gamma thread on the Mac OpenGL list: life would be a lot easier if color were linear.  Yes, it would be easier.  But would it be beautiful?&lt;br /&gt;&lt;br /&gt;The answer is: not at 8 bits, and definitely not with DXT compression.&lt;br /&gt;&lt;br /&gt;The following images show a gray-scale bar, quantized to: 16, 8, 6, and 5 bits per channel.  (16-bits per channel would be typical of a floating point, HDR, or art asset pipeline, 8-bits is what most apps will have to run on the GPU, and 5/6 bits simulate the banding in the key colors of DXT-compressed textures, which are 5-6-5.)&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_TrRVoYy3Itc/TOwVNUG2HaI/AAAAAAAAAqw/5uaID-YSobQ/s1600/srgb_5.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://2.bp.blogspot.com/_TrRVoYy3Itc/TOwVNUG2HaI/AAAAAAAAAqw/5uaID-YSobQ/s200/srgb_5.png" alt="" id="BLOGGER_PHOTO_ID_5542828559866142114" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_TrRVoYy3Itc/TOwVMStZ5eI/AAAAAAAAAqo/XMrYf3IYnC8/s1600/srgb_6.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://1.bp.blogspot.com/_TrRVoYy3Itc/TOwVMStZ5eI/AAAAAAAAAqo/XMrYf3IYnC8/s200/srgb_6.png" alt="" id="BLOGGER_PHOTO_ID_5542828542311130594" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TOwVLQLfchI/AAAAAAAAAqg/Vpag9vx3NoM/s1600/srgb_8.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TOwVLQLfchI/AAAAAAAAAqg/Vpag9vx3NoM/s200/srgb_8.png" alt="" id="BLOGGER_PHOTO_ID_5542828524452147730" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TOwVK5LZ66I/AAAAAAAAAqY/dw9oM-SiZCM/s1600/srgb_16.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TOwVK5LZ66I/AAAAAAAAAqY/dw9oM-SiZCM/s200/srgb_16.png" alt="" id="BLOGGER_PHOTO_ID_5542828518277770146" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;In the images labeled "srgb" (gamma is 1.0) the colors are quantized in sRGB (non-linear) space.  Becuase sRGB is perceptually even, the banding &lt;span style="font-style: italic;"&gt;appears&lt;/span&gt; to be even to a human - it's a good use of our limits bits.  8-bit color is pretty much smooth, and artifacts are minimized for 5 and 6 bits (although we can definitely see some banding here.)&lt;br /&gt;&lt;br /&gt;Now what happens if we quantize in &lt;span style="font-style: italic;"&gt;linear&lt;/span&gt; space?  You'd get this:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_TrRVoYy3Itc/TOwVzon-jvI/AAAAAAAAArQ/LAqw8e_4JAM/s1600/linear_5.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://3.bp.blogspot.com/_TrRVoYy3Itc/TOwVzon-jvI/AAAAAAAAArQ/LAqw8e_4JAM/s200/linear_5.png" alt="" id="BLOGGER_PHOTO_ID_5542829218208845554" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_TrRVoYy3Itc/TOwVzKeeytI/AAAAAAAAArI/NJfusQex6PE/s1600/linear_6.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://1.bp.blogspot.com/_TrRVoYy3Itc/TOwVzKeeytI/AAAAAAAAArI/NJfusQex6PE/s200/linear_6.png" alt="" id="BLOGGER_PHOTO_ID_5542829210115951314" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_TrRVoYy3Itc/TOwVydJ9W5I/AAAAAAAAArA/CkvF23TSAX8/s1600/linear_8.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://1.bp.blogspot.com/_TrRVoYy3Itc/TOwVydJ9W5I/AAAAAAAAArA/CkvF23TSAX8/s200/linear_8.png" alt="" id="BLOGGER_PHOTO_ID_5542829197950278546" border="0" /&gt;&lt;/a&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_TrRVoYy3Itc/TOwVxZvM3yI/AAAAAAAAAq4/pqSh5kJM3Ng/s1600/linear_16.png"&gt;&lt;img style="cursor: pointer; width: 200px; height: 150px;" src="http://4.bp.blogspot.com/_TrRVoYy3Itc/TOwVxZvM3yI/AAAAAAAAAq4/pqSh5kJM3Ng/s200/linear_16.png" alt="" id="BLOGGER_PHOTO_ID_5542829179852873506" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Note: the program generates these ramps in sRGB space (hence they are "evenly spaced", converts to linear, quantizes, then converts back.  So this is what your textures would look like if your art assets were converted to and stored linearly.&lt;br /&gt;&lt;br /&gt;What can we see?  Well, if we have 16-bits per channel we're still okay.  But at 8-bits (the normal way to send an uncompressed texture to the GPU) we have visible banding in the darker regions.  This is because linear isn't an efficient way to space out limited bits for our eyes.&lt;br /&gt;&lt;br /&gt;The situation is &lt;span style="font-style: italic;"&gt;really&lt;/span&gt; bad for the 6 and 5-bit compressed textures; we have so little bandwidth that the entire dark side of the spectrum is horribly quantized.&lt;br /&gt;&lt;br /&gt;The moral of the story (if there is one): gamma is your friend - it's non-linear, which is annoying for lighting shaders, but when you have 8 bits or less, it puts the bits where you need them.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7268261190704414734?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7268261190704414734/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/value-of-gamma-compression.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7268261190704414734'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7268261190704414734'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/value-of-gamma-compression.html' title='The Value Of Gamma Compression'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_TrRVoYy3Itc/TOwVNUG2HaI/AAAAAAAAAqw/5uaID-YSobQ/s72-c/srgb_5.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5277324452313062504</id><published>2010-11-22T20:30:00.003-05:00</published><updated>2010-11-22T20:31:34.036-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='C'/><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><title type='text'>I hate C, part 492.</title><content type='html'>FMTT.&lt;div&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo"&gt;HIWindowCopyColorSpace_f = (&lt;span style="color: #703daa"&gt;CGColorSpaceRef&lt;/span&gt; (*)(&lt;span style="color: #703daa"&gt;WindowRef&lt;/span&gt;)) &lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;CFBundleGetFunctionPointerForName, tib, &lt;span style="color: #78492a"&gt;CFSTR&lt;/span&gt;(&lt;span style="color: #d12f1b"&gt;"_HIWindowCopyColorSpace"&lt;/span&gt;);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo"&gt;HIWindowSetColorSpace_f = (&lt;span style="color: #703daa"&gt;OSStatus&lt;/span&gt; (*)(&lt;span style="color: #703daa"&gt;WindowRef&lt;/span&gt;,&lt;span style="color: #703daa"&gt;CGColorSpaceRef&lt;/span&gt;))&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;CFBundleGetFunctionPointerForName, tib, &lt;span style="color: #78492a"&gt;CFSTR&lt;/span&gt;(&lt;span style="color: #d12f1b"&gt;"_HIWindowSetColorSpace"&lt;/span&gt;);&lt;/p&gt;&lt;/div&gt;&lt;div&gt;The set of legal C programs is just not that different from the set of ASCII strings.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5277324452313062504?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5277324452313062504/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/i-hate-c-part-492.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5277324452313062504'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5277324452313062504'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/i-hate-c-part-492.html' title='I hate C, part 492.'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8588366727120888608</id><published>2010-11-22T18:50:00.004-05:00</published><updated>2010-11-26T11:10:19.815-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Gamma and Lighting Part 3: Errata</title><content type='html'>A few other random notes for working with lighting in &lt;a href="http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-2-working-in.html"&gt;linear space&lt;/a&gt;...&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Light Accumulation&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In order to get correct lighting, lights need to be accumulated (that is, the contribution of each light to any given pixel) in linear space.  There are three ways to do this safely:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;Accumulate all lights in a single pass.  The sum happens in your shader (which must calculate each light's contribution and add them).  This is typical of a traditional one-pass forward renderer, and is straight forward to convert to linear space.  The lighting calculation is done linearly in shader, converted to something with gamma, then written to an 8-bit framebuffer.&lt;/p&gt;&lt;p&gt;Clamping and exposure control are up to you and can happen in-shader while you are in floating point.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Multi-pass with sRGB_framebuffer.  If you want to accumulate lights by blending (e.g. using blending to add more "light" into the framebuffer and your framebuffer is 24-bit RGB, you'll need the sRGB_framebuffer extension.  This will cause OpenGL to not only accept linear fragments from you, but the blending addition will happen in linear space too.&lt;/p&gt;&lt;p&gt;In this case, exposure control is tricky; you don't want to saturate your framebuffer, but no one lighting pass knows the total exposure.  You'll have to set your exposure so that the conversion to sRGB doesn't "clip".&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Multi-pass with an HDR framebuffer.  The other option for accumulating light by blending is to use a floating point framebuffer.   This configuration (like mutli-pass with sRGB) might be typical of a deferred renderer (or a stencil-shadow-volume solution).&lt;/p&gt;&lt;p&gt;Unlike sRGB into a 24-bit framebuffer, this case doesn't have a clipping problem, because you can write linear vlaues into the floating point framebuffer.  You do need to "mix down" the floating point framebuffer back to sRGB later.&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;What you cannot do is convert back to sRGB in your shader and then use regular blending.  Doing so will add lights in sRGB space, which is incorrect and will lead to blown out or clipped lights wherever they overlap.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;What Color Space Is The Mac Framebuffer?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;On OS X 10.5 and older, your framebuffer will have the color profile of the device, at least as far as I can tell.  This probably means that the effective gamma is 1.8 (since the OS's color management should adjust the LUT enough to achieve this net result).&lt;br /&gt;&lt;br /&gt;On OS X 10.6 every window has its own color profile, and the window manager color-converts as needed to get correct output on multiple devices. &lt;span style="font-weight: bold;"&gt;Edit:&lt;/span&gt; By default you get the monitor's default  color profile, but you can use HIWindowSetColorSpace to change what color profile you work in.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8588366727120888608?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8588366727120888608/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-3-errata.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8588366727120888608'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8588366727120888608'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-3-errata.html' title='Gamma and Lighting Part 3: Errata'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-1850197716359023086</id><published>2010-11-22T17:16:00.003-05:00</published><updated>2010-11-22T17:32:51.972-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Gamma and Lighting Part 2: Working in Linear Space</title><content type='html'>In my &lt;a href="http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-1-color-sync.html"&gt;previous post&lt;/a&gt; I tried to describe the process of maintaining color sync.  Two important things to note:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Most color spaces that are reasonable for 24-bit framebuffers aren't linear.  Twice the RGB means a lot more than twice the luminance.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;This is good from a data-size standpoint, because 8 bits for channel isn't enough to be linear.&lt;/li&gt;&lt;/ul&gt;But there's a problem: this non-linear encoding is not good if we're going to &lt;a href="http://http.developer.nvidia.com/GPUGems3/gpugems3_ch24.html"&gt;perform 3-d calculations to create computer-generated images&lt;/a&gt; of light sources.  Note that this is not an issue of monitor calibration or which gamma curve (Mac or PC) you use; no color space with any gamma is going to be even remotely close to linear luminance.  So this is almost a 'wrong format' issue and not a 'wrong calibration' issue.&lt;br /&gt;&lt;br /&gt;Consider: light is additive - if we add more photons, we get more light.  This is at the heart of a computer graphics lighting model, where we sum the contribution of several lights to come up with a  luminance for an RGB pixel.  But remember the math from the previous post: doubling the RGB value more than doubles the luminance from your monitor.&lt;br /&gt;&lt;br /&gt;In order to correctly create lighting effects, we need to:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Convert from sRGB to linear color.&lt;/li&gt;&lt;li&gt;Do the lighting accumulation in linear color space.&lt;/li&gt;&lt;li&gt;Convert back to sRGB because that's the format the framebuffer needs.&lt;/li&gt;&lt;/ol&gt;Doing this makes a huge difference in the quality of lighting.  When physical lighting calculations are done directly in sRGB space, intermediate light levels are too dark (cutting the sRGB value in half cuts the luminance by a factor of five!) and additive effects become super-bright in their center.  I found that I can also set ambient lighting to be lower when using correct linear lighting because the intermediate colors aren't so dark.  (With intermediate colors dark, you have to turn up ambience or the whole image will be dark.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Let the GPU Do It&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The OpenGL extensions &lt;a href="http://www.opengl.org/registry/specs/EXT/texture_sRGB.txt"&gt;GL_EXT_texure_sRGB&lt;/a&gt; and &lt;a href="http://www.opengl.org/registry/specs/ARB/framebuffer_sRGB.txt"&gt;GL_ARB_framebuffer_sRGB &lt;/a&gt;basically do steps 1 and 3 for you; when you set a texture's internal type to sRGB, the GPU converts from sRGB to linear space during texel fetch.  When framebuffer_sRGB is enabled, the GPU converts from linear back to sRGB before writing your fragment out to the framebuffer.  Thus your shader runs in linear space (which is fine because it has floating point precision) while your textures and framebuffer are sRGB like they've always been.*&lt;br /&gt;&lt;br /&gt;The advantage of using these extensions on DirectX 10 hardware is that the conversion happens before texture filtering and after framebuffer blending - two operations you couldn't "fix" manually in your shader.  So you get linear blending too, which makes the blend of colors look correct.&lt;br /&gt;&lt;br /&gt;Of course, your internal art asset format has to be sRGB in order for this to work, because it's the only color space the GL will convert from and back to.&lt;br /&gt;&lt;br /&gt;* The question of whether your framebuffer is sRGB or linear is really more a question of naming convention.  If you go back 10 years, you know a few things: writing RGB values into the framebuffer probably produces color on the monitor that is close to what you'd expect from sRGB, but the GL does all lighting math linearly.  So it's really sRGB data being pushed through a linear pipeline, which is wrong and the source of lighting artifacts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-1850197716359023086?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/1850197716359023086/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-2-working-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1850197716359023086'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1850197716359023086'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-2-working-in.html' title='Gamma and Lighting Part 2: Working in Linear Space'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-529909192781751111</id><published>2010-11-22T16:39:00.004-05:00</published><updated>2010-11-22T17:33:26.777-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Macintosh'/><category scheme='http://www.blogger.com/atom/ns#' term='Windows'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Gamma and Lighting Part 1: Color Sync</title><content type='html'>The topic of color spaces, gamma, and color correction gets complex and muddled quickly enough that I had to delete my post and start over.  Fortunately a number of folks on the Mac OpenGL list set me straight.  This first post will discuss how to control color when working with textures; the second one will address how to light those textures in 3-d.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.poynton.com/PDFs/GammaFAQ.pdf"&gt;This FAQ&lt;/a&gt; explains Gamma infinitely better than I possibly can.  I suggest reading it about eight times.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;24-bit Color Spaces Are Pretty Much Never Linear&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;A color space defines the meaning of RGB colors (that is, triplets of 8-bit or more codes).  How red is red?  That's what your color space tells you. And the first thing I must point out is: no color space you'd ever use on an 8-bit-per-channel (24-bit color) display is ever going to have a linear mapping between RGB values and luminance.  Why not?  This &lt;a href="http://en.wikipedia.org/wiki/SRGB"&gt;Wikipedia&lt;/a&gt; article on sRGB puts it best:&lt;br /&gt;&lt;blockquote&gt;This nonlinear conversion means that sRGB is a reasonably efficient use  of the values in an integer-based image file to display  human-discernible light levels.&lt;/blockquote&gt;Here's what's going on:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Our perception of color is non-linear.  It's roughly logarithmic.  This means we're more discerning of light level changes at low light levels.  That's good - we can see in the dark, sort of, or at least  this improves our dynamic range of light perception.&lt;/li&gt;&lt;li&gt;256 levels of gray isn't a lot, so it's important that the levels be spread out evenly in terms of what humans perceive.  Otherwise we'd run out of distinct gray levels and see banding.&lt;/li&gt;&lt;li&gt;Therefore, to make 24-bit images look good, pretty much every color space including sRGB is radically different from linear &lt;a href="http://en.wikipedia.org/wiki/Luminance"&gt;luminance&lt;/a&gt; levels.&lt;/li&gt;&lt;/ul&gt;When we discuss images with different gamma below, let's just keep this in the back of our mind: pretty much any sane color space is going to have &lt;span style="font-style: italic;"&gt;some&lt;/span&gt; gamma if it will be used for 24-bit images, and that's good because it lets us avoid banding and other artifacts.  It will also become important when we get to lighting.&lt;br /&gt;&lt;br /&gt;Part of the confusion over gamma correction is that CRTs have non-linear response by nature of their electronics.  People assume that this is bad and that gamma correction curves make the RGB-&gt;luminance response linear.  In fact, that response is actually useful, and that's why the gamma correction curves on a computer typically undo &lt;span style="font-style: italic;"&gt;some&lt;/span&gt; of the non-linear response.  For example, Macs used to bring the gamma curve down to 1.8 (from 2.5 for a CRT) but not all the way down to 1.0.  As mentioned above, if we really had linear response between our RGB colors and luminance, we would need more than 8 bits per channel.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Consistent Color&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If we are going to develop a simulator that renders using OpenGL onto a PC or Mac, we're going to work in a color space, probably with 24-bit color most of the time.  (We might use higher bit depths in advanced rendering modes, but we can't afford to store our entire texture set in a higher res.  Heck, it's expensive to even turn off texture compression!)&lt;br /&gt;&lt;br /&gt;So the important thing is that color is maintained consistently through the authoring pipeline to the end user.  In practice that means:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;Artists need to have their monitors correctly calibrated.  If the artist's monitor isn't showing enough red, the artist paints everything to be too red, and the final results look Mars-tacular.  So the artist needs good "monitoring."&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The color space needs to be tracked through the entire production process.  Whether this happens by simply declaring a global color space for the entire pipeline or by tagging assets, we need to make sure that red keeps meaning the same red, or that if we change it, we record that we've changed it and correctly adjust the image.&lt;/p&gt;&lt;p&gt;When working with PNG, we try to record the gamma curve on the PNG file.  We encourage our artists to work entirely in sRGB.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The graphics engine needs to not trash the color on read-in.  Similar to the pipeline, we need to either not mess up the colors, or track any change in color space.  In X-Plane's space, X-Plane will try to run in approximately the color space of the end user's machine, so X-Plane's texture loader will change the color space of textures at load time.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The graphics engine needs to cope with a user whose display is atypical (or not).  Finally, if the user's display is not in the same color space as the production pipeline, the engine needs to adjust its output.  (Or not adjust its output and expect the user to fix it locally.)  In X-plane's case we do this by using the user's machine's gamma curve and correcting images at load time.&lt;/p&gt;&lt;p&gt;This is not the best solution for color accuracy; it would be better to keep the images in their saved color space and convert color in-shader (where we have floating point instead of 8-bits per channel).  Historically X-Plane does the cheap conversion to save fill rate.  We may some day change this.&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;(Note: pre-converting color is particularly a poor idea for DXT-compressed textures, because you can adjust the pair of 565 key colors in each block but you can't adjust the weightings, which are clamped at fixed mix-points based on the DXT specification.  So the mid-tone colors for each block will be distorted by trying to color correct a compressed image.)&lt;br /&gt;&lt;br /&gt;If all of these steps are taken, the end user should see textures just as the art team authored them.  Before we had step four in X-Plane, PC users used to report that X-Plane was "too dark" because their monitor's color response wasn't the same as our art team's (who were using Macs).  Now they report that X-Plane is too dark because it's dark. :-)&lt;br /&gt;&lt;br /&gt;In my &lt;a href="http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-2-working-in.html"&gt;next post&lt;/a&gt; I'll discuss the relationship between non-linear color spaces (that work well in 24-bit color) and 3-d lighting models.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-529909192781751111?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/529909192781751111/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-1-color-sync.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/529909192781751111'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/529909192781751111'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/gamma-and-lighting-part-1-color-sync.html' title='Gamma and Lighting Part 1: Color Sync'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8404512233431248361</id><published>2010-11-11T09:58:00.002-05:00</published><updated>2010-11-11T10:19:34.599-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='C'/><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><title type='text'>The Very Best Seventies Technology</title><content type='html'>I do not like Lisp.  Actually, that's not quite correct.  I despite Lisp the way Red Sox fans despise the Yankees and Glenn Beck despises fiat money.  (BTW there has &lt;span style="font-style: italic;"&gt;never&lt;/span&gt; been a better time to buy gold.  &lt;a href="http://www.npr.org/blogs/money/2010/10/14/130575234/we-bought-gold"&gt;Srsly&lt;/a&gt;.)  Lisp is not a computer language; it is a cult that warps the minds of otherwise capable programmers and twists their very notion of what a program is.*&lt;br /&gt;&lt;br /&gt;So it is in this context that I began a conscious campaign to reduce the number of parenthesis in my C++.  I know, some of you may be saying "&lt;a href="http://www.aristeia.com/books.html"&gt;say what you mean, understand what you say&lt;/a&gt;" and all of that feel-good mumbo jumbo.  Heck, my own brother told me not to minimize parens.  But I can't have my carefully crafted C++ looking like Lisp.  It just won't do.&lt;br /&gt;&lt;br /&gt;So slowly I started to pay attention to operator precedence, to see when I didn't actually need all of those "safety" parens.  And here's what I found: 95% of the time, the C operator precedence makes the easy and obvious expression the default.  I was actually surprised by this, because on the face of it the &lt;a href="http://www.swansontec.com/sopc.html"&gt;order looks pretty arbitrary&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;If you squint though, you'll see a few useful groups:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Unary operators before binary operators.&lt;/li&gt;&lt;li&gt;Math before comparison.&lt;/li&gt;&lt;li&gt;Comparison before anything if-like (e.g. &amp;amp;&amp;amp; which is more like control flow than an operator).&lt;/li&gt;&lt;li&gt;Assignment at the bottom.&lt;/li&gt;&lt;/ul&gt;There's just one rub: comparison is higher precedence than bit-wise binary operators, which is to say:&lt;br /&gt;&lt;blockquote&gt;if( value &amp;amp; mask == flag)&lt;/blockquote&gt;doesn't do what you want.  You have to write the more annoying:&lt;br /&gt;&lt;blockquote&gt;if((value &amp;amp; mask) == flag)&lt;/blockquote&gt;So what went wrong?  It turns out there's a &lt;a href="http://cm.bell-labs.com/cm/cs/who/dmr/chist.html"&gt;reason&lt;/a&gt;!&lt;br /&gt;&lt;br /&gt;Approximately 5,318 years ago when compilers were made out of yarn and a byte only had five bits, C was being built within the context of B and BCPL.  If you thought C was cryptic and obtuse, you should see &lt;a href="http://en.wikipedia.org/wiki/B_%28programming_language%29#Example"&gt;B&lt;/a&gt; and &lt;a href="http://en.wikipedia.org/wiki/BCPL#Examples"&gt;BCPL&lt;/a&gt;.  B is like C if you removed anything that might tell you what the hell is actually going on, and BCPL looks like you took C, put it in a blender with about 3 or 4 other languages, and played "&lt;a href="http://www.willitblend.com/%27"&gt;will it blend&lt;/a&gt;".  (Since BCPL is "no longer in common use", apparently the answer is no.  But I guess they couldn't have thought it looked like a C blend at the time, as C hadn't been invented.)&lt;br /&gt;&lt;br /&gt;Anyway, in B and BCPL, &amp;amp; and | had a sort of magic property: inside an if statement they used lazy evaluation (like &amp;amp;&amp;amp; and || in C/C++) - they wouldn't even bother with the second operand if the first was true or false.  So you could write things like this: if(ptr &amp;amp; ptr-&gt;value) safely.  But you could also write flag = ptr &amp;amp; 1; to extract the low bit.&lt;br /&gt;&lt;br /&gt;In a rare moment of preferring sanity over voodoo, Dennis Ritchie chose to split &amp;amp; and | into two operators: | and &amp;amp; would be bit-wise and always evaluate both operators, while &amp;amp;&amp;amp; and || would work logically and short-circuit.  But since they already had piles of code using &amp;amp; as both, they had to keep the precedence of &amp;amp; the same as in B/BCPL (that is, low precedence like &amp;amp;&amp;amp;) or go back and add parens to all existing code. &lt;br /&gt;&lt;br /&gt;So while &amp;amp; could be higher precedence, it's not for historical reasons.  But have patience; we've only had to live with this for 38 years.  I am sure that in another 40 or 50 years we'll clean things up a bit.&lt;br /&gt;&lt;br /&gt;* A program is a huge mess of confusing punctuation.  Something clean and elegant like this: &lt;a href="http://simson.net/ref/ugh.pdf"&gt;for(;P("\n"),R=;P("|"))for(e=C;e=P("_"+(*u++/ 8)%2))P("|"+(*u/4)%2);&lt;br /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8404512233431248361?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8404512233431248361/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/very-best-seventies-technology.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8404512233431248361'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8404512233431248361'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/very-best-seventies-technology.html' title='The Very Best Seventies Technology'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-1838135947339370405</id><published>2010-11-10T15:19:00.003-05:00</published><updated>2010-11-10T15:26:26.593-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><title type='text'>Finding Mom and Dad</title><content type='html'>I was looking to use generic programming to clean up some otherwise picky code today, but I fear I can't find a way to do it that doesn't cost me storage.  (Well, that's not 100% true, but I digress.)&lt;br /&gt;&lt;br /&gt;There may be cases where I have an object O with sub-parts S1, S2, etc. where the implementation of S1/S2 could be made more efficient if part of their data could be pushed out to O.  An example would be an object containing several intrinsic linked lists.  The linked lists can be made faster by maintaining a stack of unused nodes, like this:&lt;br /&gt;&lt;blockquote&gt;struct node {&lt;br /&gt; node * next;&lt;br /&gt; ...&lt;br /&gt;};&lt;br /&gt;struct O {&lt;br /&gt; node * S1_list_head;&lt;br /&gt; node * S2_list_head;&lt;br /&gt; node * free_stack_top;&lt;br /&gt;};&lt;br /&gt;&lt;/blockquote&gt;Our object contains two lists, but only one common stack pool.&lt;br /&gt;&lt;br /&gt;We can use templates to safely hide away the maintenance of this code, sort of, e.g.&lt;br /&gt;&lt;blockquote&gt;template&lt;class&gt;&lt;br /&gt;void pop_front(node *&amp;amp; head_list, node *&amp;amp; free_list)&lt;br /&gt;{&lt;br /&gt;  * k = head_list;&lt;br /&gt; head_list = head_list-&gt;next;&lt;br /&gt; k-&gt;next = free_list;&lt;br /&gt; free_list = k;&lt;br /&gt;}&lt;br /&gt;&lt;/blockquote&gt;There's just one problem: if we call pop_front and reverse the argument order for our head and free list, we are going to create a world of hurt.  What we really want is something like S1_head_list.pop_front();&lt;br /&gt;&lt;br /&gt;What I can't seem to find is a way to turn S1 and S2 into objects that have knowledge of their contained parts.  In the above case, we would have to template S1 and S2 based on their byte offset from themselves to the location of their common section in the parent.  That's something the compiler knows, but won't tell us, and we don't want to figure out by hand.&lt;br /&gt;&lt;br /&gt;The best real alternative I suppose would be to wrap the templated list function in something inside O and then make the lists private, limiting the number of cases where a usage error of the template functions can occur.&lt;br /&gt;&lt;br /&gt;The traditional work-around for this would be to include a pointer to O inside S1 and S2.  This is clean and fool-proof, but also increases the storage requirements of S1 and S2; if we are storage sensitive to O, this isn't acceptable.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-1838135947339370405?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/1838135947339370405/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/finding-mom-and-dad.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1838135947339370405'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1838135947339370405'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/finding-mom-and-dad.html' title='Finding Mom and Dad'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3704308050360543926</id><published>2010-11-10T12:45:00.002-05:00</published><updated>2010-11-10T12:59:25.315-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='MediaWiki'/><title type='text'>MediaWiki and ModSecurity</title><content type='html'>We were seeing 500 Internal Server Errors:&lt;br /&gt;&lt;blockquote&gt;The server encountered an internal error or misconfiguration and was unable to complete your request.&lt;br /&gt;Please contact the server administrator, webmaster@xsquawkbox.net and inform them of the time the error occurred, and anything you might have done that may have caused the error.&lt;br /&gt;More information about this error may be available in the server error log.&lt;br /&gt;&lt;/blockquote&gt;My host finally figured out what was going wrong.  This was in /www/logs/error_log:&lt;br /&gt;&lt;blockquote&gt;Wed Nov 10 12:48:15 2010] [error] [client 71.248.161.106] ModSecurity: Access denied with code 500 (phase 2). Pattern match "(insert[[:space:]]+into.+values|select.*from.+[a-z|A-Z|0-9]|select.+from|bulk[[:space:]]+insert|union.+select|convert.+\\(.*from)" at ARGS:wpTextbox1. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "355"] [id "300016"] [rev "2"] [msg "Generic SQL injection protection"] [severity "CRITICAL"] [hostname "www.xsquawkbox.net"] [uri "/xpsdk/mediawiki/index.php"] [unique_id "TNra30PhuxAAAE8SUkMAAAAQ"]&lt;/blockquote&gt;Whoa.  What is that?  The server has &lt;a href="http://www.modsecurity.org/"&gt;ModSecurity&lt;/a&gt; installed, including a bunch of rules (as defined by regular expressions) designed to reject, um, bad stuff.  The rule seems to come from &lt;a href="http://www.gotroot.com/downloads/ftp/mod_security/all-rules.conf"&gt;here&lt;/a&gt; and MediaWiki isn't the only program that it can hose.&lt;br /&gt;&lt;br /&gt;If you pull apart the regular expression, you can see how things go wrong.  Loosely speaking the rule matches text in this form:&lt;br /&gt;&lt;blockquote&gt;insert ___ into ___ values|select ___ from ___ from ___ insert|union ___ select|convert &lt;/blockquote&gt;where the blanks can be anything, can pipe indicates that either word is acceptable.  So...insert, into, values, from, form, insert, convert.  Those words appear in that sequence of comments in my OpenAL sample.  And frankly, it's not a very remarkable sequence, hence it matching &lt;a href="https://www.modsecurity.org/tracker/browse/CORERULES-16"&gt;this&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;So I thought the problem was long posts, but it wasn't.  The longer the post, the more likely that a particular sequence of words would show up.&lt;br /&gt;&lt;br /&gt;From what I can tell, white-listing URLs from the rule is the "standard" fix.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3704308050360543926?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3704308050360543926/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/mediawiki-and-modsecurity.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3704308050360543926'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3704308050360543926'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/mediawiki-and-modsecurity.html' title='MediaWiki and ModSecurity'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-110659422785274908</id><published>2010-11-08T20:34:00.003-05:00</published><updated>2010-11-08T21:00:29.180-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Macintosh'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenAL'/><category scheme='http://www.blogger.com/atom/ns#' term='Linux'/><category scheme='http://www.blogger.com/atom/ns#' term='Windows'/><title type='text'>OpenAL on Three Platforms</title><content type='html'>&lt;a href="http://climpxwss01.creativelabs.com/openal/default.aspx"&gt;OpenAL&lt;/a&gt; is a cross-platform 3-d sound API.  It is not my favorite sound API, but it is cross-platform, which is pretty handy if you work on a cross-platform game.  Keeping client code cross-platform with OpenAL is trivial (as long as you don't depend on particular vendor extensions) but actually getting an OpenAL runtime is a little bit trickier.  This post describes a way to get OpenAL on three platforms without too much user hassle.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;OS X&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;On OS X things are pretty easy: OpenAL ships with OS X as a framework dating back to, well, long enough that you don't have to worry about it.  Link against the framework and go home happy.&lt;br /&gt;&lt;br /&gt;Well, maybe you do have to worry.  OpenAL started shipping with OS X with OS X 10.4.  If you need to support 10.3.9, weak link against the framework and check an OpenAL symbol against null to handle a missing installation at run-time.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Linux&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;On Linux, OpenAL is typically in a shared object, e.g. libopenal.so.  The problem is that the major version number of the .so changed from 0 to 1 when the reference implementation was replaced with OpenAL Soft.  Since we were linking against libopenal.so.0, this broke X-Plane.&lt;br /&gt;&lt;br /&gt;My first approach was to yell and complain in the general direction of the Linux community, but this didn't actually fix the problem.  Instead, I wrote a wrapper around OpenAL, so that we could resolve function pointers from libraries opened with dlopen.  Then I set X-Plane up to first try libopenal.so.1 and then libopenal.so.0.&lt;br /&gt;&lt;br /&gt;(Why did the .so number change?  The argument is that since the .0 version contained undocumented functions, technically the removal of those undocumented functions represents an ABI breakage.  I don't quite buy this, as it punishes apps that follow the OpenAL spec to protect apps that didn't play by the rules.  But again, my complaining has not changed the .so naming conventions.)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Windows&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The Windows world is a bit more complicated because there are two separate things to consider:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The implementation of OpenAL (e.g. who provided openal32.dll).&lt;/li&gt;&lt;li&gt;The renderer (e.g. which code is actually producing audio).&lt;/li&gt;&lt;/ul&gt;Basically Creative Labs wanted to create an ecosystem like OpenGL where users would have drivers installed on their machine matching specialized hardware.  So the Create implementation of openal32.dll searches for one or more renderers and can pass through OpenAL commands to any one of them.  The standard OpenAL "redistributable" that Creative provides contains both this wrapper and a software-only renderer on top of DirectSound (the "generic software" renderer).&lt;br /&gt;&lt;br /&gt;OpenAL Soft makes things interesting: you can install OpenAL soft into your system folder and it becomes yet another renderer.  Or you can use it instead of any of the Creative components and then you get OpenAL soft and no possible extra renderers.&lt;br /&gt;&lt;br /&gt;Now there's one other issue: what if there is no OpenAL runtime on the user's machine?  DirectSound is pretty widely available, but OpenAL is not.&lt;br /&gt;&lt;br /&gt;Here we take advantage of our DLL wrapper from the Linux case above: we package OpenAL Soft with the app as a DLL (it's LGPL).  We first try to open openal32.dll in the system folder (the official way), but if that fails, we fall back and open our own copy of LibOpenAL Soft.  Now we have sound everywhere and hardware acceleration if it's available.&lt;br /&gt;&lt;br /&gt;One final note: in order to safely support third party windows renderers like Rapture3D, we need to give the user a way to pick among multiple devices, rather than always opening the default device (which is standard practice on Mac/Linux).  This can be done with some settings UI or some heuristic to pick renderers.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-110659422785274908?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/110659422785274908/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/openal-on-three-platforms.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/110659422785274908'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/110659422785274908'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/11/openal-on-three-platforms.html' title='OpenAL on Three Platforms'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3928509137435325724</id><published>2010-10-08T13:15:00.002-04:00</published><updated>2010-10-08T13:27:07.149-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Why GPU Sliced Shadows Fail For Clouds</title><content type='html'>I have discovered through experimentation that NVidia's technique for self-shadowing particle volumes (found &lt;a href="http://developer.download.nvidia.com/compute/cuda/sdk/website/C/src/smokeParticles/doc/smokeParticles.pdf"&gt;here&lt;/a&gt;) doesn't work well for a flight simulator cloud system.  When reading a white paper, it can be hard to judge the appropriateness of an algorithm for a particular application; here's what went wrong in our case.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Basic Algorithm&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The basic algorithm is something like this:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Sort the particles to be directional for &lt;span style="font-style: italic;"&gt;both&lt;/span&gt; the light source and the viewer.  (This can require rendering front-to-back to the viewer at times.)&lt;/li&gt;&lt;li&gt;Along this direction, slice the particles up.  For each slice, plot first, then update our shadows.&lt;/li&gt;&lt;li&gt;Composite the finished system to screen (necessary if we are going front-to-back).&lt;/li&gt;&lt;/ol&gt;The algorithm produces nice soft self-shadowing because the shadow texture is being incrementally updated as we move through the slices.&lt;br /&gt;&lt;br /&gt;The algorithm does work well; for a test case with a cloud built to meet the algorithm's requirements, the shadows were soft, real-time, and quite plausible.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Performance Bottlenecks&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The algorithm has two basic performance bottlenecks:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Like all over-drawn particle system algorithms, it is fill rate limited if we overlap too many particles.&lt;/li&gt;&lt;li&gt;Slicing requires finishing rasterization to a texture and then using the texture, so the algorithm is bound by the number of slices.  (The slicing can affect both time spent in the driver rebuilding the pipeline, including costs of changing the render target, and it can stall depending on how smart your driver is about requiring pending rasterization to complete.)&lt;/li&gt;&lt;/ul&gt;The paper points both of these out and notes that the number of slices may have to be traded for performance.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Overdraw and Alpha&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The algorithm is a little bit mismatched to a flight simulator cloud system because a flight simulator cloud system typically uses a smaller number of more opaque cloud particles to avoid fill-rate issues.  This causes problems because the algorithm doesn't naturally diminish self-shadowing; it depends on the fact that we haven't accumulated a large number of particles to keep shadows very light when two particles are near each other.&lt;br /&gt;&lt;br /&gt;So the first problem in general use is that the quality of the shadows fights with the optimization of relatively opaque particles.  As soon as we make fewer, smaller, more opaque particles (which can be coped with via texturing) the quality of the shadows becomes quite poor.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Slicing and Bucketing&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The second problem is that for a general large-scale particle field we need some kind of bucketing, and this fights with slicing.  We want to break our particles into a bucket grid for two reasons:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;It gives us a way to rapidly cull a lot of particles.&lt;/li&gt;&lt;li&gt;The bucket grid has a traversal order that is back to front, so we only need to Z-sort within a bucket, saving a lot of sorting time.&lt;/li&gt;&lt;/ul&gt;The problem is this: we don't know the relationship spatially between slices of different buckets, so we need to slice &lt;span style="font-style: italic;"&gt;within&lt;/span&gt; a bucket, but do this for &lt;span style="font-style: italic;"&gt;each&lt;/span&gt; bucket on screen.  So if we have 12 buckets on screen, we have 12x the number of slices.&lt;br /&gt;&lt;br /&gt;Slices are really quite expensive due to the GPU setup overhead, and even a small number of buckets means that we can't afford enough slices.  NVidia recommends 32-128 slices, but with buckets, you'll be lucky to get 8 slices per bucket.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Low Slice Count = Ugly&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;It goes without saying that having a small number of slices is going to produce less correct shadows.  But there is another, more serious problem: as you rotate the camera, the slicing plane changes.  Nearby particles that are in the same plane will not shadow each other, but when/how this happens is a function of how wide the slicing plane is and which way it goes.&lt;br /&gt;&lt;br /&gt;What this means is: as we rotate the camera, some particles will suddenly stop shadowing each other as the slicing planes rotate, causing noticeable popping artifacts.&lt;br /&gt;&lt;br /&gt;The really bad artifact comes when we go from having the sun slightly facing to us to slightly facing away from us.  At that point the algorithm will switch between back-to-front and front-to-back rendering, and the slicing plane will jump by 90 degrees almost instantly.  This produces a huge number of artifacts when the number of slices is small.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Summary&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The algorithm fails when:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We have mostly opaque particles and&lt;/li&gt;&lt;li&gt;We can't afford enough slices and&lt;/li&gt;&lt;li&gt;There are external constraints (like culling) artificially "wasting" slices.&lt;/li&gt;&lt;/ul&gt;Unfortunately, that is us...so...on to other techniques.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3928509137435325724?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3928509137435325724/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/10/why-gpu-sliced-shadows-fail-for-clouds.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3928509137435325724'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3928509137435325724'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/10/why-gpu-sliced-shadows-fail-for-clouds.html' title='Why GPU Sliced Shadows Fail For Clouds'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7999159379073609450</id><published>2010-10-07T10:19:00.001-04:00</published><updated>2010-10-07T10:19:00.162-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Alpha Blending, Lets Try Again</title><content type='html'>A while ago I posted &lt;a href="http://hacksoflife.blogspot.com/2010/02/alpha-blending-back-to-front-front-to.html"&gt;this convoluted mess&lt;/a&gt; of recipes for blending back to front and front to back.  I've had some time to revisit the code, and the actual formulas are simpler than I realized and more consistent; they also don't require split blending functions for the back to front composited case, which is nice if you want to run on, well, on dinosour hardware (since pretty much anything you can find has split blending functions from the Radeon 8500 on).&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Premultiplied Alpha&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The goal is to composite several translucent textures together, and then composite them over our scene as if the whole scene had been drawn in order.  In order to make this work, we want to use premultiplied alpha - that is, textures where the RGB color has already been made 'darker' if the alpha channel is not 1.0.  In this scheme our blend function can be (1.0, 1.0 - SA) instead of the normal (SA, 1.0-SA) because the source pixel is already multiplied by SA.  That would be the premultiplication.&lt;br /&gt;&lt;br /&gt;Why is &lt;a href="http://hacksoflife.blogspot.com/2010/10/premultiplication-pros-and-cons.html"&gt;premultiplication a good idea&lt;/a&gt;?  We have to solve the problem of "what is under translucent", and premultiplication does that.  In a premultiplied texture, the RGB channel becomes more black as it becomes more transparent, and thus "nothing" has a valid color representation (black).  In a traditional texture, there &lt;span style="font-style: italic;"&gt;is&lt;/span&gt; color behind transparent, and that can cause sampling artifacts.&lt;br /&gt;&lt;br /&gt;So our goal is to composite a premultiplied texture.  That means that the "clear" will be 0,0,0,0 (black, transparent).  Note that while the color is black (meaning nothing to add color-wise) we still need that alpha channel to be 0 (transparent) too to tell us that the background won't be occluded.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Fixing Back to Front&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If you have ever blended together a bunch of geometry (back to front) and then composited the result on top of something else, you know that the alpha channel for that back-to-front geometry is going to be pretty screwed up.  To see the problem, imagine blending a really light (10% alpha) screen over an already opaque scene.  That light screen will (by a "strength" of 10%) move the alpha channel away from opacity and toward translucency.  The problem is that the alpha blends itself, and we don't want that.&lt;br /&gt;&lt;br /&gt;It turns out that pre-multiplied alpha can fix this.  We set our blending equation to (1.0, 1.0-SA) and we pre-multiply our RGB.  Our alpha will now be the old alpha (lightened by the amount the new alpha is "covering it") plus the new alpha, but not lightened.&lt;br /&gt;&lt;br /&gt;To take the case of a 10% screen over an opaque scene, the alpha will be 0.1 * 1.0 + 1.0 * (1.0 - 0.1), which gives us...1.0, which is exactly right: blending over an opaque object doesn't make it translucent.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Front to Back&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For the front to back case, we still want to use pre-multiplied alpha, but we set our blend factors to (1.0-DA, 1.0).  With the back to front case in "pre-multiplied" form, this should look very symmetric.  In fact, all we're doing is changing which one is the "master" (whose alpha cuts down the other" and which is not).&lt;br /&gt;&lt;br /&gt;What effectively happens is:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The less alpha is in the buffer already, the more you get to draw (ehnce 1.0-DA as a factor).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The buffer is never reduced in color (which makes sense, since you can't darken something by drawing behind it).&lt;/li&gt;&lt;li&gt;The amount of alpha opacity you leave behind/add-in is also reduced by what is already there (you matter less if you are behind something translucent).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7999159379073609450?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7999159379073609450/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/10/alpha-blending-lets-try-again.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7999159379073609450'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7999159379073609450'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/10/alpha-blending-lets-try-again.html' title='Alpha Blending, Lets Try Again'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5299129231646043432</id><published>2010-10-06T10:45:00.001-04:00</published><updated>2010-10-06T10:45:44.528-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Premultiplication: Pros and Cons</title><content type='html'>I realized today that premultiplied alpha could fix a nasty artifact that we sometimes get in X-Plane: "tree ring".*&lt;br /&gt;&lt;br /&gt;The bug is this: imagine you have two texels in your texture.  The left  one is transparent, and the right one is opaque green (a tree).  What is  the RGB "behind" the transparent one?  Let's call it junk.&lt;br /&gt;&lt;br /&gt;When this texture is sampled with linear filtering, the graphics card  will do the wrong thing: it will blend the two texels by channel to come  up with a texel sample that is a mix of green + junk in the RGB channel  and a translucent alpha channel.  Thus at the edges of our  alpha-blended tree, we will see a 'ring' of junk leaking into the  texture.&lt;br /&gt;&lt;br /&gt;The traditional work-around (and the one we use for X-Plane) is to  ensure that the RGB behind the transparent parts of the texture contains  something valid that we wouldn't mind seeing, e.g. "more green".  This  is not an ideal work-around because Photoshop will put white in this  space when alpha reaches 0%, so most artists will have to manually fix  this problem over and over (and it's not an easy problem to see since  the erroneous color is behind a 0% alpha pixel).&lt;br /&gt;&lt;br /&gt;If we used pre-multiplied alpha, this would not be a problem.  With  premultiplied alpha, the RGB pixels are already multiplied by the alpha  channel; thus the transparent pixel is by definition black (0% alpha *  any RGB = 0,0,0 = black).  Thus when we blend green and black we get  "darker green", which is the appropriate pre-multiplied color for a  linear sampling at the edge of our tree.  Simply put, premultiplying  puts the alpha multiply before linear interpolation, which i what we  want.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Compression?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I can think of a possible reason to not use pre-multiplied alpha in  production art assets: texture compression.  If I have a solid green  tree with an alpha channel, my texture compressor uses all of its "color  bits" to get that green color right.  But if I premultiply, those color  bits are now storing both the color and the effect of alpha (the  darkening).  I may get some color distortion on my tree because the  compressor is trying to get the pre-multiplied alpha right.&lt;br /&gt;&lt;br /&gt;In other words, a non-premultiplied texture may compress better.   Ideally I'd like my compressor to be alpha-aware, that is, optimize the  color under the opaque part at the expense of what is under the  transparent part.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Rest Of the Story&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Obviously we're not going to change X-Plane to premultiplication given  so many art assets out there.  But there is more to the story too.&lt;br /&gt;&lt;br /&gt;The * up there is that there is a second, significantly worse cause of  "rings" on trees: z-buffer artifacts.  The z-buffer doesn't handle  translucency very well (and by that I mean it doesn't handle it at  all).  If our trees contain translucent edges due to linear filtering,  we get Z put down over the translucent parts, and that cuts out any 3-d  building or additional trees behind them.  The result is "blue rings"  where the sky shows through what should be a forest.&lt;br /&gt;&lt;br /&gt;The solution  is the one we use in practice: we turn off blending  entirely and simply test the texels - they are in or out.  We still use  linear filtering though, so that the alpha edge of our tree isn't square  and jagged, so we would see a ring if we have bogus color underneath  the transparent parts of the trees.  Since in practice we almost always  ship DXT compressed textures, the compression argument against  pre-multiplication holds.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5299129231646043432?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5299129231646043432/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/10/premultiplication-pros-and-cons.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5299129231646043432'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5299129231646043432'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/10/premultiplication-pros-and-cons.html' title='Premultiplication: Pros and Cons'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8205288554951081623</id><published>2010-09-27T14:01:00.003-04:00</published><updated>2010-09-27T14:29:39.637-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='Computational Geometry'/><title type='text'>When Good Floating Point Goes Bad?</title><content type='html'>We have a handful of linear algebra/geometry routines in X-Plane that simplify writing geometric tests.  In their heart, they almost all turn into dot products.  So: when can we not trust them?&lt;br /&gt;&lt;ul&gt;&lt;li&gt;For real floating point data, dot products may not go to zero; the zero dot product is at the heart of "is this a left or right turn" and "which side of a line am I on".   Rounding errors may force points to fall to one side of a line or another.&lt;/li&gt;&lt;li&gt;Even more weirdly, there's no guarantee that points will &lt;span style="font-style: italic;"&gt;consistently&lt;/span&gt; fall on one side of a line or another; the rounding errors need to be treated as effectively random.  A point that is moving across a line may give 'jittery' results as the point slowly crosses the line.&lt;/li&gt;&lt;li&gt;Order matters.  Given the same theoretical line, defining it by swapping the input points (e.g. line AB instead of BA) may have unpredictable effects on side-of tests for very small distances.&lt;/li&gt;&lt;li&gt;Finally, there is no function more hellish than intersection.  The more parallel the lines, the more completely insane the intersection results become.  Serious paranoia is advised when dealing with intersections, because the routines can give you a positive result ("yes there was an intersection") with lunatic results ("and the intersection is on Mars").  I usually cope with this by using a dot product test to case out the near-collinear case, which is then handled in a different algorithm that doesn't require clean intersections.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8205288554951081623?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8205288554951081623/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/09/when-good-floating-point-goes-bad.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8205288554951081623'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8205288554951081623'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/09/when-good-floating-point-goes-bad.html' title='When Good Floating Point Goes Bad?'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8106076968794521246</id><published>2010-09-03T08:49:00.004-04:00</published><updated>2010-09-03T09:04:50.578-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenAL'/><category scheme='http://www.blogger.com/atom/ns#' term='Linux'/><title type='text'>OpenAL on Linux, Part 27</title><content type='html'>This &lt;a href="https://bugs.launchpad.net/bugs/273558"&gt;bug&lt;/a&gt; has effected X-Plane 8  and 9; we were able to recut 9 to work around it, but X-Plane 8 is a closed product.  Here's the short story:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A while ago intrepid developers replaced the implementation of libopenal on Linux with a new complete rewrite.&lt;/li&gt;&lt;li&gt;When they did so, they raised the major version number.&lt;/li&gt;&lt;/ul&gt;Huh???  This caused naive application developers like me to say things like "what the hell are you guys doing?  The whole point of dynamic linking is that you can replace implementations without breaking my app.  So why did you break my app?"&lt;br /&gt;&lt;br /&gt;The change in major version breaks the link to X-Plane, and would be appropriate if the library wasn't compatible.&lt;br /&gt;&lt;br /&gt;Yesterday someone finally posted a list of dropped ABI symbols in the new OpenAL implementation. They are all extension symbols except for alBufferAppendData. So I can't deny that symbols are dropped and that is an ABI breakage.  The question is: should the soname be revised?&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Extensions&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Most of the symbols missing are _LOKI.  For those not familiar with OpenGL extensions (from which the OpenAL extension concept is stolen^H^H^H^H^H^Hborrowed) the idea is this: an app initializes the library, queries some kind of string to see what additional non-core features the library supports, and then resolves function pointers at run-time, using function pointers only once the extension string is present.&lt;br /&gt;&lt;br /&gt;Therefore it's really important that the major version of the shared object &lt;span style="font-style: italic;"&gt;not&lt;/span&gt; change when an extension is removed; the extension is not part of the ABI, applications should not (and cannot) depend on it being present at link-time, and an extension may not function without specialized hardware.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;alBufferAppendData&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There is one mystery symbol: alBufferAppendData, which is present without a decoration.  From what I can tell from the annotated OpenAL 1.0 specification, "append-data" was a proposed streaming scheme that was eventually moved to an extension when it was dropped from the core.  It's not in the 1.0 spec and it's not in the 1.1 spec.&lt;br /&gt;&lt;br /&gt;So this strikes me as a bug in the implementation of the original library: it exports a symbol that shouldn't be there.  Does it make sense to raise the major version of the .so because the symbol has been dropped?  I don't think so, but I can see how you could argue it both ways.&lt;br /&gt;&lt;br /&gt;The argument for dropping it is this: if the major version is changed, then the old and new OpenAL implementations can live side by side, and all applications are happy.  Since alBufferAppendData is not trivial functionality, this would be better than expecting the new implementation to support alBufferAppendData for historic reasons.&lt;br /&gt;&lt;br /&gt;But this is not at all what is happening; instead distributions are purging libopenal.so.0 (the old implementation) when they bring in the new one, and then asking applications to recompile themselves.&lt;br /&gt;&lt;br /&gt;In other words, because some number of applications may be using a function that is not in any OpenAL specification but is in the old implementation, they have renamed the shared library, forcing everyone to recompile.  In other words, they have replaced the convenience of having some games be broken &lt;span style="font-style: italic;"&gt;&lt;/span&gt;with the convenience of having all games be broken.&lt;br /&gt;&lt;br /&gt;(In X-Plane we work around this by simply dlopening either libopenal.so.0 or libopenal.so.1, whichever one we can find.  Since both implement the core spec symbols, this works fine.)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8106076968794521246?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8106076968794521246/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/09/openal-on-linux-part-27.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8106076968794521246'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8106076968794521246'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/09/openal-on-linux-part-27.html' title='OpenAL on Linux, Part 27'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3942825613485881570</id><published>2010-09-01T02:40:00.002-04:00</published><updated>2010-09-01T02:44:17.438-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='Threading'/><category scheme='http://www.blogger.com/atom/ns#' term='Debugging'/><title type='text'>I Just Saw a Race Condition</title><content type='html'>I was debugging a threading bug and something truly bizarre happened to me: I was printing variables and when I went back to re-print a variable, it had changed on me!  This was without actually ever running the program.&lt;br /&gt;&lt;br /&gt;As far as I can tell, this is what happened:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The variable I was printing was subject to writes in a race condition - some other thread was splatting it.&lt;/li&gt;&lt;li&gt;After printing the variable, I printed some pieces of an STL container, which had to execute code in the attached process, which temporarily released all threads.&lt;/li&gt;&lt;li&gt;Thus when I turn around, the program has been running.&lt;/li&gt;&lt;/ol&gt;In some ways, it was a really lucky find...after scratching my head and going "it's 2:30 AM and I've been drinking...that didn't just happen" I realized that the variable in question was being passed into the function by reference and thus might be secretly global (which was of course the real bug).  Had I not seen the reference var get splatted under my nose it would have taken a lot more printing to find the problem.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3942825613485881570?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3942825613485881570/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/09/i-just-saw-race-condition.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3942825613485881570'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3942825613485881570'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/09/i-just-saw-race-condition.html' title='I Just Saw a Race Condition'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-9109900524728997741</id><published>2010-08-11T15:41:00.002-04:00</published><updated>2010-08-11T15:56:45.620-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>When Is Your VBO Double Buffered?</title><content type='html'>A while ago I finally wrapped my head around this, and wrote a &lt;a href="http://hacksoflife.blogspot.com/2010/02/double-buffering-vbos.html"&gt;three&lt;/a&gt; &lt;a href="http://hacksoflife.blogspot.com/2010/02/double-buffering-part-2-why-agp-might.html"&gt;part&lt;/a&gt; &lt;a href="http://hacksoflife.blogspot.com/2010/02/one-more-on-vbos-glbuffersubdata.html"&gt;post&lt;/a&gt; trying to explain why you never get double-buffered behavior from a VBO unless you orphan it.  This is going to be an attempt to explain the issues more succinctly and describe how to stream data through a VBO.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;The Problem&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The problem that OpenGL developers (myself included) crash into is a stall in the OpenGL pipeline when trying to specify vertex data that changes every frame.  You go preparing new VBOs of meshes (for a particle system for example), and when you go into your favorite adaptive sampling profiler, you find that you're blocking in one of glMapBuffer or glBufferSubData.&lt;br /&gt;&lt;br /&gt;The problem is that the GPU has a "lock" on your VBO until it finishes drawing from it, preventing you from changing the VBO's contents.  You can't put the new mesh in there until the GPU is done with the old one.&lt;br /&gt;&lt;br /&gt;To understand why this happens, it helps to play "what if I had to write the GL driver myself" and look at what a total pain in the ass it would be to fix this at the driver level.  In particular, even if you did the work, your GL driver might be slower in the general case because of the overhead to be clever in this case.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Your VBO Isn't Really Double-Buffered&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Sometimes VBOs are copied from system memory to VRAM to be used.  We might naively think that if this were the case, then we could gain access to the original system copy to update it while the GPU uses the VRAM copy.&lt;br /&gt;&lt;br /&gt;In practice, this would be insanely hard to implement.  First, this scheme would only work when VBOs are being shadowed in VRAM (not the case a lot of the time) and when the VBO has already been copied to VRAM by the time we need to respecify its contents.&lt;br /&gt;&lt;br /&gt;If we haven't copied the VBO to VRAM, we'd have to stop and block application code while we DMA the VBO into VRAM (assuming the DMA engine isn't busy doing something else).  If DMA operations on the GPU have to be serialized into the general command queue, that means the DMA operation isn't going to happen for a while.&lt;br /&gt;&lt;br /&gt;If that hasn't already convinced you that treating VRAM vs. main memory like a double buffer makes no sense, consider also that if main memory is to be released, the VRAM copy is no longer a cached shadow, it is now the only copy!  We now have to mark this block as "do not purge".  So we might be putting more pressure on VRAM by relying on it as a double buffer.&lt;br /&gt;&lt;br /&gt;I won't even try to understand the complexity that a pending glReadPixels into the VBO would have.  It should be clear at this point that even if your VBO seems double buffered by VRAM, for the purpose of streaming data, it's not.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Your VBO Isn't Made Up of Regions&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;You might not be using all of your VBO; you might draw from one half and update the other. glBufferSubData won't figure that out.  In order for it to do so, it would have to:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Know the range of the VBO used by all pending GPU operations.  (This is in theory possible with glDrawElementsRange, but not the older glDraw calls.)&lt;/li&gt;&lt;li&gt;Track the time stamp of each individual range to see how long we have to block for.&lt;/li&gt;&lt;/ul&gt;The GPU on our VBO has now changed from an integer time stamp to some kind of diverse region of time stamps with set operations.  It's not surprising that the drivers don't do this.   If you have a pending operation on any part of your VBO, glBufferSubData will block.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;One Way To Get a Double Buffer&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The one way to get a double buffer on older OpenGL implementations is to re-specify the data with glBufferData and a NULL pointer.  Most drivers will recognize that in "throwing out" the contents of your buffer, you are separating the contents of the buffer for future ops from what is already in the queue for drawing.  The driver can then allocate a second master block of memory and return that at the next glMapBuffer call.  The driver will throw out your original block of memory later at an unspecified time once the GPU is done with it.&lt;br /&gt;&lt;br /&gt;Alternatively, if you are on OS X or have a GL 3.0 extension, there are extension that let you check out and operate on a buffer with locking suspended, allowing you to manage sub-regions in your buffer independently.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-9109900524728997741?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/9109900524728997741/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/08/when-is-your-vbo-double-buffered.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/9109900524728997741'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/9109900524728997741'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/08/when-is-your-vbo-double-buffered.html' title='When Is Your VBO Double Buffered?'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-6152434906203896349</id><published>2010-08-06T20:22:00.002-04:00</published><updated>2010-08-06T20:28:31.779-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Macintosh'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Restarting the OS X Window Server for Fun and Profit</title><content type='html'>Well, it's not very profitable.  Hell, it's not even that fun.  But let's just say, hypothetically, that you were working on a flight simulator with an OpenGL rendering engine. And let's just say, to make this interesting, that if you crank up all of the new rendering engine options, sometimes it causes the OpenGL stack to completely lose its meatballs, and the resulting carnage renders the entire computer unusable.&lt;br /&gt;&lt;br /&gt;(If you are having trouble imagining this, close your eyes and visualize a desktop where nothing but the mouse moves, but as you drag what were your windows, small pieces of your scene graph flicker in and out of what used to be your open windows, as if you were just showing random parts of video memory.  Okay, maybe it is a little bit fun.)&lt;br /&gt;&lt;br /&gt;Here's what you need to get your life back:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Have remote ssh enabled in the sharing control panel.  ssh into your machine.  Odds are, the remote shell is perfectly happy, even if the desktop looks like you hired Picasso as your art lead and he was extra high that day.&lt;/li&gt;&lt;li&gt;Kill -9 pid will bring back the desktop some of the time.  That is, sometimes just killing off your app is enough to get your desktop back.  Typically this is a win in the case where the driver is constantly resetting and you just can't use the UI because the reset cycle is slow.&lt;/li&gt;&lt;li&gt;If that doesn't work, this will kill off the entire window manager (including, um, everything...the Finder, your app, X-Code, icanhazcheesburger):&lt;code&gt; sudo killall -HUP WindowServer&lt;/code&gt;&lt;/li&gt;&lt;/ol&gt;It beats a full reboot (by some marginal amount).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-6152434906203896349?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/6152434906203896349/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/08/restarting-os-x-window-server-for-fun.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6152434906203896349'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6152434906203896349'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/08/restarting-os-x-window-server-for-fun.html' title='Restarting the OS X Window Server for Fun and Profit'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-6387368907555514792</id><published>2010-08-06T13:51:00.002-04:00</published><updated>2010-08-06T14:32:36.157-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='Threading'/><category scheme='http://www.blogger.com/atom/ns#' term='Quotes'/><title type='text'>A Healthy Fear of Threading</title><content type='html'>Continuing in the line of pithy quotes:&lt;br /&gt;&lt;blockquote&gt;There are only two kinds of programmers: programmers with a healthy fear of threaded code and programmers who &lt;span style="font-style: italic;"&gt;should&lt;/span&gt; fear code.&lt;br /&gt;&lt;/blockquote&gt;Now I'm not saying "never thread".  I'm just saying "you better be getting something good for that threading, because it's driving up your development costs."&lt;br /&gt;&lt;br /&gt;In particular, the effective execution order of threaded code can change with every run, and there is no guarantee that you have seen every combination of execution order by running your program a finite number of times.&lt;br /&gt;&lt;br /&gt;Thus methods of checking your code quality by running your program (perhaps many times) won't detect bugs in threaded code.  You may not find out until that user with one more core and a background program chewing up cycles hits an execution order that you haven't seen yet.&lt;br /&gt;&lt;br /&gt;Instead for threaded code you have to prove logically that the execution order constraints applied (via locking, etc.) create a bounded set of execution combinations, and that each one is correct.  This isn't quick or easy to do.&lt;br /&gt;&lt;br /&gt;One way we cope with this development cost in X-Plane (where we need to use threads to fully utilize multiple cores) is to use threading design patterns with known execution limits.  The most common one is a message queue, where ownership of data access flows with the message down a queue.  This idiom not only guarantees serialized access to data without locks, but the implementation in C++ tends to make errors rare; if you have the message you have the pointer, and thus you have rights on the data.  If you don't have the message, you have nothing to dereference.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-6387368907555514792?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/6387368907555514792/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/08/healthy-fear-of-threading.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6387368907555514792'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6387368907555514792'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/08/healthy-fear-of-threading.html' title='A Healthy Fear of Threading'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8291078353750587527</id><published>2010-07-18T12:20:00.003-04:00</published><updated>2010-07-18T12:22:25.025-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>How Does OpenGL Work?</title><content type='html'>If you are registered with Apple's developer sites, I strongly recommend the OpenGL and OpenGL ES video talks from WWDC 2010.  The Apple engineers spell out in a fair amount of detail things that you had to infer previously, including:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How state information is accumulated and then resynchronized at draw call time.&lt;/li&gt;&lt;li&gt;How resources are synchronized and shared between host memory and the GPU.&lt;/li&gt;&lt;/ul&gt;The videos are in QuickTime format with subtitles, so you can play them back at 2x speed with captioning to get through the material faster.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8291078353750587527?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8291078353750587527/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/07/how-does-opengl-work.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8291078353750587527'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8291078353750587527'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/07/how-does-opengl-work.html' title='How Does OpenGL Work?'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-2657026844241838479</id><published>2010-06-03T23:09:00.003-04:00</published><updated>2010-06-04T11:33:00.618-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='algorithms'/><category scheme='http://www.blogger.com/atom/ns#' term='STL'/><title type='text'>Interval Sets With the STL</title><content type='html'>I spent some time today working on an interval set.  The basic idea of an interval set is to record a set of disjoint ranges that partition a number space into a finite "included" area and an infinite "excluded" area.  Or to put it more simply, [3, 6) is an interval, and [3, 6) [8, 10) is an interval set.&lt;br /&gt;&lt;br /&gt;Googling around for this I found a few ideas based on using a sorted map, with the interval beginning as the key and the interval end as the value. My approach is different, and is closer to the original implementation of Macintosh regions: a vector of beginning and ending pairs, e.g. { 3, 6, 8, 10 }.&lt;br /&gt;&lt;br /&gt;I'm not sure whether this approach is superior to a map-based approach; I think I'd have to code each one all the way to completion.  The vector does have a few advantages:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Compact storage, with minimal overhead.&lt;/li&gt;&lt;li&gt;Reading the sorted array can usually be done in linear or log-N time.&lt;/li&gt;&lt;/ul&gt;The basic rules for the interval set are:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Intervals are inclusive at the bottom and exclusive at the top.  So the interval 3,6 includes the number 3 but excludes the number 6.  (Thus given two intervals [3,6) and [6,9) the number 6 is included in exactly one of them.)&lt;/li&gt;&lt;li&gt;Intervals must have non-zero length (so [3,3) is illegal).&lt;/li&gt;&lt;li&gt;Intervals are finite - there is no notation to say "everything below 3" is included.&lt;/li&gt;&lt;/ul&gt;The heart of the algorithm is a "merge" operation.  In a merge, the sequence of interval edges from two interval sets are traversed together (think of a merge sort) and each new smallest sub-interval is evaluated for inclusion by a boolean operation.  This lets us perform a union, difference, intersection, or symmetric difference in linear time O(N+M) where N and M are the lengths of the vectors. &lt;br /&gt;&lt;br /&gt;(The actual time is actually slightly worse because vector will need to periodically reallocate its memory during the creation of the new resulting vector.  We could use a heuristic to pre-allocate some space at a loss of memory efficiency.  If we used a set we could avoid memory costs, but we'd end up with O(NlogN) time to build the set anyway, and we'd pay node overhead, which is almost certainly worse than any extra on a vector of 32-bit floats or integers.)&lt;br /&gt;&lt;br /&gt;When we search for an interval (using lower_bounds) we can tell whether we are "in" or "out" of the region by looking at whether the index of the returned region is even or odd - even regions are in the set and odd ones are outside of it.&lt;br /&gt;&lt;br /&gt;The interval class is also heavily special cased for a number of optimizations:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Separate operators on pair&lt;t,t&gt; allow for the processing of a single interval (rather than a set).  When we know the single interval, we can take a number of short-cuts, and we can perform "in-place editing" using log-N searches into the original interval set.&lt;/li&gt;&lt;li&gt;Operations on sets can identify short cuts.  For example, the intersection of two sets whose range is disjoint is always empty.  (In other words, if the last value in A is less than the first value in B, intersecting A and B is an empty set.)&lt;/li&gt;&lt;/ul&gt;I haven't used the interval set class enough to profile it; real measurement will tell which of these optimizations is a win.  One tricky aspect of the code is that vector is a leaky abstraction - it makes mid-vector insertion look cheap when really it is a linear operation (because all subsequent elements must be copied to their new locations).&lt;br /&gt;&lt;br /&gt;As an example of why this might matter: consider symmetric difference (XOR) of an interval set and a single range.  This operation can be computed simply by: deleting the range bounds from the set if they exist, otherwise inserting them.  In other words, given the interval set [0,3) [6,9) [12,15) we can XOR this with the interval [6,8) by deleting 6 and inserting 8 - the new XOR is [0,3) [8,9) [12,15).  This is a relatively fast operation: two log-N searches (for 6 and 8) and one delete and one insert.&lt;br /&gt;&lt;br /&gt;Despite the simplicity of the algorithm, vector is going to require two mid-vector editing operations, so our average time complexity is O(N) - linear!  (On average half the elements of the vector are after us, and we do two editing ops.)&lt;br /&gt;&lt;br /&gt;For this reason, the special case of a disjoint XOR is special cased.  If we XOR [-10, -8) into the above region, we can observe that -8 &lt; 0, therefore the regions don't intersect, and -10, -8 simply needs to be pre-pended.  This can be done with a single insert, and thus should run about twice as fast as a pair of individual inserts.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-2657026844241838479?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/2657026844241838479/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/06/interval-sets-with-stl.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2657026844241838479'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2657026844241838479'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/06/interval-sets-with-stl.html' title='Interval Sets With the STL'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-6234988875259574441</id><published>2010-05-06T07:56:00.003-04:00</published><updated>2010-05-06T08:19:59.384-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CGAL'/><title type='text'>Importing Faces Into CGAL Arrangements</title><content type='html'>This problems occurs repeatedly in the X-Plane scenery tools code: we need to import a series of polygons into a single arrangement_2 structure, and we want to tag the faces contained by these polygons as they appear in the arrangement.  This isn't entirely trivial for two reasons:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The polygons may collide with each other and thus there may be a 1:many relationship between the original data and the final map.&lt;/li&gt;&lt;li&gt;The polygons may be self-intersected or in other ways "hosed", so we need a particular strategy for handling this situation.&lt;/li&gt;&lt;/ul&gt;CGAL provides a number of built-in tools to deal with these situations.  Here are 3 basic and useful building blocks:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;If you are using the general polygon set code that is built on top of arrangements, you can simply perform unions and intersections of a large number of faces.  For merging a large number of areas, this code is faster than anything else you might code, because it can do an N-way divide and conquer, where N is larger than 2.&lt;/li&gt;&lt;li&gt;For custom merging and handling of multiple polygons, you can use the overlay free function, which lets you specify how the combinations of each set of face from two maps are handled.&lt;/li&gt;&lt;li&gt;You can simply insert a set of curves into an empty arrangement and they will be "swept" together.  This is a useful way to turn a messy polygon into something useful - it finds intersections, builds topology, runs quickly, and handles input no matter how degenerate.&lt;/li&gt;&lt;/ol&gt;For example, if the goal is to have any area contained by any piece of polygon as "inside" you can simply insert all polygon sides into an empty arrangement and then tag every bounded face.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Finding Polygon Internal Areas&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;When building arrangements out of polygons of dubious origin (or simply building an arrangement out of a large number of unrelated polygons) I use a bulk insert to "sweep" the curves into the arrangement.  How do I then find the faces?   Here are three techniques:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;The contained area of a polygon can be found by simply checking whether the face is bounded or not.  This is not useful though when importing multiple polygons at the same time.  (When I need this technique, each polygon is individually imported into its own arrangement, then all arrangements are merged later, typically with general-polygon-set code.)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;We can implement a "toggle" policy (e.g. each line toggles interior vs. exterior) by doing a search from the outside to the inside of the arrangement, toggling whether we are "inside" or "outside" each time we cross a halfedge.  The halfedges can retain curve-based properties; typically I use a consolidated data curve so that halfedges retain every property attached to them.&lt;/p&gt;&lt;p&gt;One danger: an antenna will produce incorrect results in this technique because it won't toggle the data property twice.  This can be hard to work around because data from the insert is maintained per &lt;i&gt;edge&lt;/i&gt;, not half-edge&lt;/p&gt;&lt;p&gt;Unfortunately, topology of the final arrangement doesn't help us resolve this either.  Imagine an antenna (in the original polygon that crashes into another polygon, thus becoming part of a real partition.  Technically the original face is not split by the antenna, but the faces in the final arrangement are, so saying face()==twin()-&gt;face() doesn't tell us we have an antenna in the original.&lt;/p&gt;&lt;p&gt;Two ways to work around this: don't insert antennas, or don't tag known antennas with any data.  Both cases require knowing that we have an antenna ahead of time.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Sometimes we want to use a "winding rule" - that is, contain areas inside closed left turning contours.  This is, for example, useful when calculating offset buffers and minkowski sums; the artifacts from strange shapes being offset too much turn out not to be left turning contours and get thrown out.&lt;/p&gt;&lt;p&gt;To find the winding rule areas, we have to look at the direction of the curves inserted.&lt;/p&gt;&lt;p&gt;The way my code does this is to look at the direction of the underlying curve vs. the direction of the half-edge and then mark the half-edge as being on the "inside" or "outside" of the winding with a dat a field on the half-edge itself.  This is reconstructed after bulk insert, and then we can traverse the whole arrangement, counting windings.&lt;/p&gt;&lt;p&gt;The limitation here is similar to above: if we have an antenna, the underlying curve can have only one direction, not two, and one half-edge will be incorrectly tagged.  Fortunately antennas are not typically necessary to produce offset buffers.&lt;br /&gt;&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;It should be noted that if you insert curves incrementally (insert one curve into the arrangement) an observer of the arrangement returns all generated and overlapped half-edges, which gives you the contained bounds of the contour inserted.  I use this technique when inserting a low-side-count face into a very complex arrangement, to avoid re-sweeping a huge amount of data.  Overlays and bulk inserts do not produce "per curve" announcements via the observer mechanism.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-6234988875259574441?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/6234988875259574441/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/05/importing-faces-into-cgal-arrangements.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6234988875259574441'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6234988875259574441'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/05/importing-faces-into-cgal-arrangements.html' title='Importing Faces Into CGAL Arrangements'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7701239682925617989</id><published>2010-04-23T10:12:00.002-04:00</published><updated>2010-04-23T10:23:40.994-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='CGAL'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>CGAL: It's All About the Mantissa</title><content type='html'>In a past post I described CGAL as having &lt;a href="http://hacksoflife.blogspot.com/2009/04/why-cgal.html"&gt;no rounding errors&lt;/a&gt;.  It does this by using number types of variable size (using dynamically allocated memory per number!) so that it never runs out of digits. (It also maintains the numerator and denominator of fractions separately to avoid problems with  repeating decimals.)&lt;br /&gt;&lt;br /&gt;The advantage of this is that geometric algorithms that rely on precise calculations never go haywire due to rounding errors.  For example, when using fixed-precision math (e.g. IEEE floats) the intersection of two near-parallel lines will be calculated inaccurately - sometimes with the intersection showing up miles from the original lines.  CGAL always has more precision, so it avoids this problem.&lt;br /&gt;&lt;br /&gt;But there is one down-side: when you perform a series of intersections, the result is exact numbers whose mantissas (the number of actual digits) have grown very long.  And CGAL won't blink about making them even longer as you do more calculations.&lt;br /&gt;&lt;br /&gt;Instead CGAL will become insanely slow.&lt;br /&gt;&lt;br /&gt;I hit this case the other day.  The first piece of processing I do is to combine a whole pile of vector data from &lt;a href="http://www.openstreetmap.org/"&gt;OSM&lt;/a&gt; into one integrated map.  While OSM is not particularly high precision (from a bits standpoint) the resulting intersecting points are calculated "perfectly", sometimes with very large mantissas.&lt;br /&gt;&lt;br /&gt;I then wrote a piece of code to take a city block from that OSM map and perform some calculations to find the sidewalk calculation.  The problem: the four corners of the city block were already very long numbers since they were the result of a CGAL calculation.  Thus a long calculation on a long calculation becomes very slow.&lt;br /&gt;&lt;br /&gt;The original algorithm took about 36 minutes for a fully optimized build to find all sidewalks in San Diego.  That is way too slow, and unusable for our project.&lt;br /&gt;&lt;br /&gt;I the put a rounding stage in: fore each corner of the block, I would convert it to a regular 64-bit IEEE float and then back to CGAL, throwing out any "extra" precision that CGAL was saving.  Note that the 64-bit float already gives me better than 1 millimeter precision, which is more than overkill for a road.  The algorithm run on the "simplified" data ran in 67 seconds.&lt;br /&gt;&lt;br /&gt;Now there is one danger: if, due to mismatched road locations in OSM or conflicting edits, some of the "blocks" were really tiny (less than 1 mm) CGAL would have correctly built that block using infinite precision, and my "rounding" would have incorrectly reshaped those blocks, perhaps turning them inside out or in some other way damaging them.&lt;br /&gt;&lt;br /&gt;So a necessary step to productizing this 'resolution reduction' is to do a sanity check on each resulting block.  Fortunately most of the time if the block contains too-small-to-use data, we don't need the data in the first place.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7701239682925617989?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7701239682925617989/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/04/cgal-its-all-about-mantissa.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7701239682925617989'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7701239682925617989'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/04/cgal-its-all-about-mantissa.html' title='CGAL: It&apos;s All About the Mantissa'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-1742267072158065172</id><published>2010-04-21T12:46:00.004-04:00</published><updated>2010-04-21T12:51:17.759-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Quotes'/><title type='text'>Constitutional Opposition</title><content type='html'>One part of &lt;a href="http://daringfireball.net/2010/04/why_apple_changed_section_331"&gt;this post by Daring Fireball&lt;/a&gt; on the iPhone SDK licensing agreement made me chuckle:&lt;br /&gt;&lt;blockquote&gt;If you are constitutionally opposed to developing for a platform where  you’re expected to follow the advice of the platform vendor, the iPhone  OS is not the platform for you. It never was. It never will be.&lt;br /&gt;&lt;/blockquote&gt;It inspired me to come up with a new quotable:&lt;br /&gt;&lt;blockquote&gt;If you are constitutionally opposed to developing for a platform where  you’re expected to follow the advice of the platform vendor, you should not be a computer programmer.&lt;br /&gt;&lt;/blockquote&gt;See also basically every post by &lt;a href="http://blogs.msdn.com/oldnewthing/"&gt;Raymond Chen&lt;/a&gt;: "just because, in Win98SE2, you could call SomeRandomWin32API with a combination of NULL, -1, and Bill Gate's IQ and get an undocumented behavior that violates all of Microsoft's guidelines for applications development doesn't mean it will continue to work in Windows 7."&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-1742267072158065172?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/1742267072158065172/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/04/constitutional-opposition.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1742267072158065172'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1742267072158065172'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/04/constitutional-opposition.html' title='Constitutional Opposition'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-2588042313781128750</id><published>2010-04-21T10:45:00.002-04:00</published><updated>2010-04-21T10:52:40.130-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><title type='text'>Thank You Jeeves, That Will Be All</title><content type='html'>The other day I went in to discover why a new piece of scenery code had mysteriously stopped working.  Eventually I came to this:&lt;br /&gt;&lt;blockquote&gt; (p,path.size()/2,def,degree,inExtrudeFunc,&lt;br /&gt;inObjectFunc,inChecker,ag_mode_draped_obj);&lt;/blockquote&gt;Ah!  Now it all makes sense.  The code should have read:&lt;br /&gt;&lt;blockquote&gt;AG_extrude_string(p,path.size()/2,def,degree,inExtrudeFunc,&lt;br /&gt;inObjectFunc,inChecker,ag_mode_draped_obj);&lt;blockquote&gt;&lt;/blockquote&gt;&lt;/blockquote&gt;After having done a global search, clearly I had hit the space bar by accident, nuking my function call.  The charming thing is that C++ doesn't question why I have a giant list of paranthetical "stuff", it just blissfully compiles it into an expression that does...well, pretty much nothing.&lt;br /&gt;&lt;br /&gt;Some of my other favorite C++ isms:&lt;br /&gt;&lt;blockquote&gt;case a: do_it(); break;&lt;br /&gt;b: do_x(); break; // no case, not illegal - now "b'' is a label!&lt;br /&gt;defaultl: do_more(); break; // typo in default?  That's a label too!&lt;br /&gt;&lt;/blockquote&gt;Of course we are all familiar with the fun that emerges from swapping = and ==.  And having a stray semi-colon never hurt anything.&lt;br /&gt;&lt;br /&gt;Propsman had an apt characterization: C++ is like an overly polite butler. "A...hamburger on the rocks, Sir?  Certainly, Sir, I'll bring you one directly..."&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-2588042313781128750?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/2588042313781128750/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/04/thank-you-jeeves-that-will-be-all.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2588042313781128750'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2588042313781128750'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/04/thank-you-jeeves-that-will-be-all.html' title='Thank You Jeeves, That Will Be All'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-923968653690880138</id><published>2010-03-16T13:06:00.005-04:00</published><updated>2010-03-18T14:49:57.088-04:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Santa to Ben: You're An Idiot</title><content type='html'>A few months ago I &lt;a href="http://hacksoflife.blogspot.com/2009/12/all-i-want-for-xmas-is-parallel-command.html"&gt;posted a request&lt;/a&gt; (to Santa) for parallel command dispatch.  The idea is simple: if I am going to render several CSM shadow map levels and the scene graph contained in each does not overlap, then each one is both (1) independent in render target and (2) independent in the actual work being done.  Because the stuff being rendered to each shadow map is different, using geometry shaders and multi-layer FBOs doesn't help.* My idea was: well I have 8 cores on the CPU - if the GPU could slurp down and run 8 streams, it'd be like having 8 independent GLs running at once and I'd get through my shadow prep 8x as fast.&lt;br /&gt;&lt;br /&gt;I was talking with a CUDA developer and finally I got a clue.  The question at hand was whether CUDA actually runs parallel kernels.  Her comment was that while you can queue multiple kernels asynchronously, the goal of such a technique is to keep the GPU busy - that is, to keep the GPU from going idle between batches of kernel processing.  The technique of multiple kernels isn't necessary to keep the GPU fully busy, because even iwth hundreds of shader units, the kernel is going to run over thousands or tens of thousands of data points.  That is, CUDA is intended for wildly parallel processing, so the entire swarm of "cores" (or "shaders"?) is still smaller than the number of units of work in a batch.&lt;br /&gt;&lt;br /&gt;If you submit a tiny batch (only 50 items to work over) there's a much bigger problem than keeping the GPU hardware busy - the overhead of talking to the GPU at all is going to be worse than the benefit of using the GPU.  For small numbers of items, the CPU is a better bet - it has better locality to the rest of your program!&lt;br /&gt;&lt;br /&gt;So I thought about that, then turned around to OpenGL and promptly went "man am I an idiot".  Consider a really trivial case: we're preparing an environment map, it's small (256 x 256) and the shaders have been radically reduced in complexity because the environment map is going to be only indirectly shown to the user.&lt;br /&gt;&lt;br /&gt;That's still at least 65,536 pixels to get worked over (assuming we don't have over-draw, which we do).  Even on our insane 500-shader modern day cards, the number of shaders is still much smaller than the amount of fill we have to do.  The entire card will be busy - just for a very short time.  (In other words, graphics are still &lt;i&gt;embarrassingly&lt;/i&gt; parallel.)&lt;br /&gt;&lt;br /&gt;So, at least on a one GPU card, there's really no need for parallel dispatch - serial dispatch will still keep the hardware busy.&lt;br /&gt;&lt;br /&gt;So...parallel command dispatch?  Um...never mind.&lt;br /&gt;&lt;br /&gt;This does beg the question (which I have not been able to answer with experimentation): if I use multiple contexts to queue up multiple command queues to the GPU using multiple cores (thus "threading the driver myself") will I get faster command-buffer fill and thus help keep the card busy?  This assumes that the card is going idle when it performs trivially simple batches that require a fair amount of setup. &lt;br /&gt;&lt;br /&gt;To be determined: is the cost of a batch in driver overhead (time spent deciding &lt;i&gt;whether&lt;/i&gt; we need to change the card configuration or &lt;i&gt;real&lt;/i&gt; overhead (e.g. we have to switch programs and  GPU isn't that fast at it).  It can be very hard to tell from an app standpoint where the real cost of a batch lives.&lt;br /&gt;&lt;br /&gt;Thanks to those who haunt the OpenGL forums for smacking me around^H^H^H^H^H^H^H^H^Hsetting me straight re: parallel dispatch.&lt;br /&gt;&lt;br /&gt;* geometry shaders and multilayer FBO help, in theory, when the batches and geometry for each rendering layer are the same.  But for a cube map if most of the scene is not visible from each cube face, then the work for each cube face is disjoint and we are simply running our scene graph, except now we're going through the slower geometry shader vertex path.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-923968653690880138?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/923968653690880138/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/03/santa-to-ben-youre-idiot.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/923968653690880138'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/923968653690880138'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/03/santa-to-ben-youre-idiot.html' title='Santa to Ben: You&apos;re An Idiot'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-4707434202256388417</id><published>2010-03-10T15:47:00.002-05:00</published><updated>2010-03-10T16:39:51.747-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>The Value Of Granularity</title><content type='html'>OpenGL is a very &lt;a href="http://www.joelonsoftware.com/articles/LeakyAbstractions.html"&gt;leaky abstraction&lt;/a&gt;.  It promises to draw in 3-d.  And it does!  But it doesn't say a lot about how long that drawing will take, yet performance is central to GL-based games and apps.  Filling in this gap is transient information about OpenGL and its current dominant implementations that isn't easy to come by - it comes from a mix of insight from other developers, connecting the dots, reading many disparate documents, and direct experimentation.  This isn't easy for someone who isn't working full time as an OpenGL developer, so I figure there may be some value to blogging things I have learned the hard way about OpenGL while working on X-Plane.&lt;br /&gt;&lt;br /&gt;OpenGL presents new functionality via extensions.  (It also presents new functionality via version numbers, but the extensions tend to range ahead of  the version numbers because the version number can only be bumped when &lt;i&gt;all&lt;/i&gt;required extensions are available.)  When building an OpenGL game you need a strategy for coping with different hardware with different capabilities.  X-Plane dates back well over a decade, and has been using OpenGL for a while, so the app has had to cope with pretty much every extension being not available at one point or another.&lt;br /&gt;&lt;br /&gt;Our overall strategy is to categorize hardware into "buckets".  For X-Plane 9 we have 2.5 buckets:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Pre-shader hardware, running on a fixed function pipeline.&lt;/li&gt;&lt;li&gt;Modern shader enabled hardware, using shaders whenever possible.&lt;/li&gt;&lt;li&gt;We have a few shaders that get cased off into a special bucket for the first-gen shader hardware (R300, NV25), since that hardware has some performance and capability limitations.&lt;/li&gt;&lt;/ul&gt;These buckets then get sliced up by features the user select, but these don't  complicate the buckets - we simply make sure we can shade without per pixel lighting, for example, if the user wants higher framerate.&lt;br /&gt;&lt;br /&gt;So here is what has turned out to be surprising: we were basically forced to allow X-Plane to run with a very granular set of extensions for debugging purposes.  An example will ilWhat lustrate.&lt;br /&gt;&lt;br /&gt;Using the buckets strategy you might say: "The shader bucket uses GLSL, FBOs, and VBOs.  Any hardware in that category has all three, so don't write any code that uses GLSL but not FBOs, or GLSL but not VBOs."  The idea is to save coding by reducing the combination of all possible OpenGL hardware (we have eight combos of these three extensions) to only two combinations (have them all, don't have them all).&lt;br /&gt;&lt;br /&gt;What we found in practice was that being able to run in a semi-useful state without FBOs but with GLSL was immensely useful for in-field debugging.  This is not a configuration we'd ever want to really support or use, but at least during the time period that we started using FBOs heavily, the driver support for them was spotty on the configurations we hit in-field.  Being able to tell a user to run with --no_fbos was an invaluable differential to demonstrate that a crash or corrupt screen was related specifically to FBOs and not some other part of OpenGL.&lt;br /&gt;&lt;br /&gt;As a result, X-Plane 9 can run with any of these "core" extensions in an optional mode: FBOs, GLSL, VBOs (!), PBOs, point sprites, occlusion queries, and threaded OpenGL.  That list matches a series of driver problems we ran across pretty directly.&lt;br /&gt;&lt;br /&gt;Maintaining a code base that supports virtually every combination is not sustainable indefinitely, and in fact we've started to "roll up" some of these extensions.  For example, X-Plane 9.45 requires a threaded OpenGL driver, whereas X-Plane 9.0 would run without it.  We remove support for individual extensions going missing when tech support calls indicate that "in field" the extension is now reliable.&lt;br /&gt;&lt;br /&gt;At this point it looks like FBOs, threaded OpenGL, and VBOs are pretty much stable.  But I believe that as we move forward into newer, weirder OpenGL extensions, we will need to keep another set of extensions optional on a per-feature basis as we find out the hard way what isn't stable in-field.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-4707434202256388417?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/4707434202256388417/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/03/value-of-granularity.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4707434202256388417'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4707434202256388417'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/03/value-of-granularity.html' title='The Value Of Granularity'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-6830172400431123934</id><published>2010-02-28T21:05:00.000-05:00</published><updated>2010-02-28T21:05:00.242-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>One More On VBOs - glBufferSubData</title><content type='html'>So if you survived the &lt;a href="http://hacksoflife.blogspot.com/2010/02/double-buffering-part-2-why-agp-might.html"&gt;timing of VBO updates&lt;/a&gt; (or rather, my speculations on what is possible with VBO updates), now you're in a position to ask the question: how fast might glBufferSubData be?  In particular, developers like myself are often astonished when glBufferSubData does things like block.&lt;br /&gt;&lt;br /&gt;In a world before manual synchronizing of VBOs (via the 3.0 buffer management APIs or Apple's buffer range extensions) we can now see why a sub-data buffer on a streamed VBO might perform quite badly.&lt;br /&gt;&lt;br /&gt;The naive code goes something like this:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Fill half the buffer with buffer sub-data.&lt;/li&gt;&lt;li&gt;Issue a draw call to that half of the buffer.&lt;/li&gt;&lt;li&gt;Flip which half of the buffer we are using and go back to step 1.&lt;/li&gt;&lt;/ol&gt;In other words, double buffering by dividing the buffer in half, or treating it like a ring buffer.&lt;br /&gt;&lt;br /&gt;This implementation is going to perform terribly.  T sub-data call is going to block until the previous draw call has completed, even though they use opposite halves of the buffer, and we'll lose all of our concurrency.  Let's see if we can understand why.&lt;br /&gt;&lt;br /&gt;If we go to respecify a VBO in AGP memory using glBufferSubData while that VBO is in progress, glBufferSubData must block; it can't rewrite the buffer until the last draw finishes because we would see the new vertices, not the old, or maybe half and half.  In order for the "fill" to complete, the driver would have to be able to determine that the pending draws and the new fill are completely disjoint.&lt;br /&gt;&lt;br /&gt;There are two reasons why the driver might not be able to figure this out:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;You've drawn using glDrawElements, and thus the actual part of the vertex VBO you draw from is determined by the index table.  The cost of figuring out the "extent" of this draw is to process all of the indices. The cure is worse than the disease.  Any sane driver is going to simply assume that &lt;i&gt;any&lt;/i&gt; part of the VBO could be used.&lt;/li&gt;&lt;li&gt;Let's assume you use glDrawRangeElements to tell the driver that you're really only going to use half the VBO.  Even then, the structure to mark "locked" regions would be a complex one - a series of draws over overlapping regions would require a complex data structure.  For this one special case, you're asking the drivers to replace a simple time-stamp based lock (e.g. this VBO is locked until this many commands have executed) with a dynamic range marking structure.  If I were a driver writer I'd say "let's keep it simple and not eat this cost on all VBOs."&lt;/li&gt;&lt;/ol&gt;I think it's safe to assume that some implementations (and all if you use glDrawElements) are simply going to mark the entire VBO as in use until the draw happens, and thus the partial rewrite is going to block as if there was a conflict, even if there was not.&lt;br /&gt;&lt;br /&gt;Can we do anything about this?  Besides falling back to an "orphaned" approach where we get a fresh buffer each time, our alternative is to use the more exact APIs from &lt;a href="http://www.opengl.org/registry/specs/ARB/map_buffer_range.txt"&gt;ARB_map_buffer_range&lt;/a&gt; or &lt;a href="http://www.opengl.org/registry/specs/APPLE/flush_buffer_range.txt"&gt;APPLE_flush_buffer_range&lt;/a&gt;.  With these APIs we can map only the part of the VBO we know is not in use, with the unsynchronized bit set to avoid blocking because the other half is pending. We can use flush explicit to then flush only the areas we modified.  (With the 3.0 APIs we can also use the discard range option to simply say "we are rewriting what we map".)&lt;br /&gt;&lt;br /&gt;Of course, this technique isn't without peril - all synchronization is up to the client.  The main danger is an over-run: your app is so fast that it needs to modify a range that the GL isn't done with - we made it all the way around our ring buffer.  Probably the safest way to cope with this is to put explicit fences in place to wait until the last dependent draw call that we issued is finished.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-6830172400431123934?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/6830172400431123934/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/one-more-on-vbos-glbuffersubdata.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6830172400431123934'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/6830172400431123934'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/one-more-on-vbos-glbuffersubdata.html' title='One More On VBOs - glBufferSubData'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7395836405818847975</id><published>2010-02-28T11:16:00.004-05:00</published><updated>2010-02-28T13:30:08.224-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Double-Buffering Part 2 - Why AGP Might Be Your Friend</title><content type='html'>&lt;a href="http://hacksoflife.blogspot.com/2010/02/double-buffering-vbos.html"&gt;In my previous post&lt;/a&gt; I suggested that to get high VBO vertex performance in OpenGL, it's important to decouple pushing the next set of vertices from the GPU processing the existing ones.  A naively written program will block when sending the next set of vertices until the last one goes down the pipe, but if we're clever and either orphan the buffer or use the right flags, we can avoid the block.&lt;br /&gt;&lt;br /&gt;(My understanding is that orphaning actually gets you a second buffer, in the case where you want to double the entire buffer.  With manual synchronization we can simply be very careful and use half the buffer each frame.  &lt;i&gt;Very careful.&lt;/i&gt;)&lt;br /&gt;&lt;br /&gt;Now I'm normally a big fan of geometry in VRAM because it is, to put it the Boston way, "wicked fast".  And perhaps it's my multimedia background popping up, but to me a nice GPU-driven DMA seems like the best way to get data to the card.  So I've been trying to wrap my head around the question: why not double-buffer into VRAM?  This analysis is going to get into the highly speculative - the true answer I think is "the devil is in the details, and the details are in the driver", but at least we'll see that the issue is very complex, double-buffering into VRAM has a lot of things that could go wrong, so we should not be surprised if when we tell OpenGL that we intend to stream our data it gives us AGP memory instead.*&lt;br /&gt;&lt;br /&gt;Before we look at the timing properties of an application using AGP memory or VRAM, let's consider how modern OpenGL implementations work: they "run behind".  By this I mean: you ask OpenGL to draw something, and some time later OpenGL actually gets around to doing it.  How much behind?  Quite possibly a lot.  The card can run behind at least an entire frame, depending on implementation, maybe two.  You can keep telling the GPU to do more stuff until:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;You hit some implementation defined limit (e.g. you get 2 full frames ahead and the GPU says "enough!").  Your app blocks in the swap-backbuffer windowing system call.&lt;/li&gt;&lt;li&gt;You run out of memory to build up that outstanding "todo" list.  (Your app blocks inside the GL driver waiting for command buffers - the memory used to build the todo list.)&lt;/li&gt;&lt;li&gt;You ask the OpenGL about something it did, but it hasn't done it. (E.g. you try to read an occlusion query that hasn't finished and block in the "get' call.)&lt;/li&gt;&lt;li&gt;You ask to take a lock on a resource that is still pending for draw.  (E.g. you do a glMapBuffer on a non-orphaned VBO with outstanding draws, and you haven't disabled sync with one of the previously mentioned extensions.)&lt;/li&gt;&lt;/ol&gt;There may be others, but I haven't run into them yet.&lt;br /&gt;&lt;br /&gt;Having OpenGL "run behind" is a good thing for your application's performance.  You can think of your application and the GPU as a reader-writer problem.  In multimedia, our top concern would be underruns - if we don't "feed the beast" enough audio by a deadline, the user hears the audio stop and calls tech support to complain that their expensive ProTools rig is a piece of junk.  With an OpenGL app, underruns (the GPU got bored) and overruns (the app can't submit more data) aren't fatal, but they do mean that one of your two resources (GPU and CPU) are not being fully used.  The &lt;i&gt;longer&lt;/i&gt; the length of the FIFO (that is, the more OpenGL can run behind without an overrun) the more flexibility we have to have the speed of the CPU (requesting commands) and the GPU (running the commands) be mismatched for short periods of time.&lt;br /&gt;&lt;br /&gt;An example: the first thing you do is draw a planet - it's one VBO, the app can issue the command in just one call.  Very fast!  But the planet has an expensive shader, users a ton of texture memory, and fills the entire screen.  That command is going to take a little time for the GPU to finish.  The GPU is now "behind."  Next you go to draw the houses.  The houses sit in a data structure that has to be traversed to figure out which houses are actually in view.  This takes some CPU time, and thus it takes a while to push those commands to the GPU.  If the GPU is still working on the planet, then by the time the GPU finishes the planet, the draw-house commands are ready, and the GPU moves seamlessly from one task to the other without ever going idle.&lt;br /&gt;&lt;br /&gt;So we know we want the GPU to be able to run behind and we don't want to wait for it to be done.  How well does this work with the previous posts double-buffer scheme?  It works pretty well.  Each draw has two parts: a "fill" operation done on the CPU (map orphaned buffer, write into AGP memory, unmap) and a later "draw" operation on the GPU.  Each one requires a lock on the buffer actually being used.  If we can have two buffers underneath our VBO (some implementations may allow more - I don't know) then:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The fill operation on frame 3 will wait for the draw operation on frame 1.&lt;/li&gt;&lt;li&gt;The fill operation on frame 4 will wait for the draw operation on frame 2.&lt;/li&gt;&lt;li&gt;The draw operation on frame N always waits for the fill operation (of course).&lt;/li&gt;&lt;/ul&gt;This means we can issue up to two full frames of vertices.  On the third frame (if frame one is &lt;i&gt;still&lt;/i&gt; not finished) only then might we block.  That's good enough for me.&lt;br /&gt;&lt;br /&gt;If the buffer is going to be drawn from VRAM, things get trickier.  We now have three steps:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;"fill" the system RAM copy.  Fill 2 waits on DMA 1.&lt;/li&gt;&lt;li&gt;"DMA" the copy from system RAM to VRAM.  DMA 2 waits on fill 2 and draw 1.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;"draw" the copy from VRAM.  Draw 1 waits on DMA 1.&lt;/li&gt;&lt;/ul&gt;Now we can start to see why the timing might be worse if our data is copied to VRAM.  That DMA transfer is going to have to happen after the last draw (so the VRAM buffer is available) and before the next fill (because we can't fill until the data has been safely copied).  It is "sandwiched" and it makes our timing a lot tighter.&lt;br /&gt;&lt;br /&gt;Consider the case where the DMA happens right after we finish filling the buffer.  In this case, the DMA is going to block on the last draw not completing - we can't specify frame 2 until frame 1 draw is mostly done.  That's bad.&lt;br /&gt;&lt;br /&gt;What about the case where the DMA happens really late, right before the draw really happens.  Filling buffer 2 is going to block taking a lock until the previous frame 1 DMA completes.  That's bad too!&lt;br /&gt;&lt;br /&gt;I believe that there is a timing that isn't as bad as these cases though: if the OpenGL driver can schedule the DMA as early as possible once the card is done with the last draw, the DMA ends up with timing somewhere in between these two cases, moving around depending on the actual relationship between GPU and CPU speed.&lt;br /&gt;&lt;br /&gt;At a minimum I'd summarize the problem like this: since the DMA requires both of our buffers (VRAM and system) to be available at the same time, the DMA has to be timed just right to keep from blocking the CPU.  By comparison, a double-buffered AGP strategy simply requires locking the buffers.&lt;br /&gt;&lt;br /&gt;To complete this very drawn out discussion: why would we even want to stream out of VRAM? As was correctly pointed out on the OpenGL list, this strategy requires an extra copy of the data - our app writes it, the DMA engine copies it, then the GPU reads it.  (With AGP, the GPU reads what we write.)  The most compelling case that I could think of, the one that got me thinking about this, is the case where the streaming ratio isn't 1:1.  We specify our data per frame, but we make multiple rendering passes per frame.  Thus we draw our VBO perhaps 2 or 3 times for each rewrite of the vertices, and we'd like to only use bus  up once.  A number of common algorithms (environment mapping, shadow mapping, early Z-fill) all run over the scene graph multiple times, often with the assumption that geometry is cheap (which mostly it is).&lt;br /&gt;&lt;br /&gt;But this whole post has been pretty much entirely speculative.  All we can do is clearly signal our intentions to the driver (are we a static, stream, or dynamic draw VBO) and orphan our buffers and hope the driver can find a way to keep giving us buffers rapidly without blocking, while getting our geometry up as fast as possible.&lt;br /&gt;&lt;br /&gt;* We might want to assume this and then be careful about how we write our buffer-fill code so that it is efficient in uncached write-combined memory: we want to fill the buffer linearly in big writes and not read or muck around with it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7395836405818847975?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7395836405818847975/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/double-buffering-part-2-why-agp-might.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7395836405818847975'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7395836405818847975'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/double-buffering-part-2-why-agp-might.html' title='Double-Buffering Part 2 - Why AGP Might Be Your Friend'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7029028865508708854</id><published>2010-02-24T13:56:00.008-05:00</published><updated>2010-02-24T19:43:55.385-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>Double-Buffering VBOs</title><content type='html'>One of the tricky aspects of the OpenGL API is that it specifies what an implementation will do, but it doesn't specify how fast it will do it. Plenty of forum posts are dedicated to OpenGL applications developers trying to figure out what the "fast path" is (e.g. what brew of calls will make it through the implementation in the least amount of time).  ATI and NVidia, for their part, drop hints in a number of places as to what might be fast, but sadly they don't have enough engineers to simply teach every one of us, one on one, how to make our apps less atrocious.&lt;br /&gt;&lt;br /&gt;One more bit of background: I don't know &lt;i&gt;squat&lt;/i&gt; about Direct3D.  I have never worked on Direct3D applications code, I have never used the API, and I couldn't even list all of the classes.  I only became aware of D3D's locking APIs recently when I found some comparisons between OGL and D3D when it comes to buffer management.  So whatever I say about D3D, just assume it's wrong in subtle ways that are important but hard to detect.&lt;br /&gt;&lt;br /&gt;If you only want to draw a mesh, but never change it, life is easy.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Create a static-draw VBO.&lt;/li&gt;&lt;li&gt;Fill it with geometric goodness with glMapBuffer or glBufferData.&lt;/li&gt;&lt;li&gt;Draw it many times.&lt;/li&gt;&lt;li&gt;Hilarity ensues.&lt;/li&gt;&lt;/ol&gt;Things become more tricky if your VBO has to change per frame.  First there's the obvious cost: you're going to burn some host-to-graphics-card bandwidth, because the new geometry has to go to the card &lt;i&gt;every frame&lt;/i&gt;.  So you do some math and realize that PCIe buses are really quite fast and this is a non-issue.  Yet the actual performance isn't that fast.&lt;br /&gt;&lt;br /&gt;The non-obvious cost is synchronization.  When you map your buffer to place the new vertices using glMapBuffer, you're effectively waiting on a mutex that can be owned by you or the GPU - the GPU will keep that lock from when you issue the draw call until the draw call completes.  If the GPU is 'running behind' (that is, commands are completing significantly later than they are issued) you'll block on the lock.&lt;br /&gt;&lt;br /&gt;Why is there a lock that we can block on?  Well, there are basically two cases:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The "AGP" case: your VBO lives in system memory and is visible to the GPU via the GART.  That is, it is mapped into the GPU and CPU's space.  In this case, there is only one buffer, and changing the buffer on the CPU will potentially change the buffer before the draw happens on the GPU.  In this case we really do have to block.&lt;/li&gt;&lt;li&gt;The "VRAM" case: your VBO lives in both system memory and VRAM - the system memory is a backup/master copy, and the VRAM copy is a cached copy for speed.  (This is like a "managed" resource in D3D, if I haven't completely misinterpreted the D3D docs, which I probably have.)&lt;/li&gt;&lt;/ol&gt;In this second case, you might think that because the old data is in VRAM, you should be able to grab a lock on the system memory to begin creating the new data without blocking.  This rapidly goes from the domain of "what can we observe about GL behavior" to "what do we imagine those whacky driver writers are going under there". The short version is: that might be true sometimes, other times it's definitely not going to be true, it's going to very much depend on how the driver is structured, etc. etc.  The long version is long enough to warrant a separate post.&lt;br /&gt;&lt;br /&gt;D3D works around this with D3DLOCK_DISCARD.  This tells the driver that you want to completely rebuild the buffer.  The driver then hands you a possibly unrelated piece of memory to fill in, rather than waiting for the real buffer to be available for locking.  The driver makes a note that when the real draw operation is done, the buffer's "live" copy is now free to be reused, and the newly specified buffer is the "live" copy.  (This is, of course, classic double-buffering.)&lt;br /&gt;&lt;br /&gt;You can achieve the same effect in OpenGL using one of two techniques:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If you have OpenGL 3.0 or &lt;a href="http://www.opengl.org/registry/specs/ARB/map_buffer_range.txt"&gt;GL_map_buffer_range&lt;/a&gt; you can use the flag GL_MAP_INVALIDATE_BUFFER_BIT on your glMapRange call to signal that the old data can be discarded after GPU usage.&lt;/li&gt;&lt;li&gt;You can simply do a glBufferData with NULL as a base pointer before you map.  Since the contents of the buffer are now undefined, the implementation is free to pull the double-buffering optimization.  (See the discussion of DiscardAndMapBuffer in the &lt;a href="http://www.opengl.org/registry/specs/ARB/vertex_buffer_object.txt"&gt;VBO extension spec&lt;/a&gt;.)&lt;/li&gt;&lt;/ul&gt;If you develop on a Mac, you can see all of this pretty easily in Shark.  If you map a buffer that you've rendered to without first "orphaning" it with glBufferData, you'll see (in a "time profile - all thread states" profile that captures thread blocking time) a lot of time spent in glMapBuffer, with a bunch of calls to internal functions that appear to "wait for time stamp" or "wait for finish object" or something else that sort of seems like it might be waiting.  This is your thread waiting for the GPU to say it's done with the buffer.  Orphan the buffer first, and the blockage goes away.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7029028865508708854?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7029028865508708854/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/double-buffering-vbos.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7029028865508708854'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7029028865508708854'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/double-buffering-vbos.html' title='Double-Buffering VBOs'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3829533617520952338</id><published>2010-02-18T18:37:00.004-05:00</published><updated>2010-02-18T19:27:13.040-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Alpha Blending, Back To Front, Front To Back</title><content type='html'>I was reading NVidia's white paper on &lt;a href="http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/smokeParticles/doc/smokeParticles.pdf"&gt;smoke particles&lt;/a&gt; and came across the notion of front-to-back blending.  The idea is to change OpenGL's blend equation so that you can start at the front and blend in behind translucent geometry.&lt;br /&gt;&lt;br /&gt;To blend front to back, you must have a destination surface that has an alpha channel, because the surface alpha channel remembers how much the next layer "shows through' the closer layer already put down.&lt;br /&gt;&lt;br /&gt;To render front to back, we need to do three unusual things:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Init our background to all black, all translucent (0,0,0,0).&lt;br /&gt;&lt;/li&gt;&lt;li&gt;We set a blend function of GL_ONE_MINUS_DST_ALPHA, GL_ONE.  This means that the new layer is dimmed to be the remainder of the opacity already put down.&lt;/li&gt;&lt;li&gt;We need to pre-multiply our fragment's RGB by its alpha, because this isn't being done by the alpha blender anymore.&lt;/li&gt;&lt;/ol&gt;One of the fun side effects of front-to-back transparency is that the final alpha channel in our surface is the correct alpha to draw our composited layers over another scene.&lt;br /&gt;&lt;br /&gt;One down side of front to back is that we can't use it on top of an existing scene unless the existing scene has an alpha channel that is set to clear.  (This is usually not what you'd find after rendering.)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Compositing&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If you then want to put the front-to-back mixed layers on top of another layer, you need to use a blend function of GL_ONE, GL_ONE_MINUS_SRC_ALPHA.  Why?  Well, since we rendered over black, our mix is "pre-multiplied" by its alpha value - that is, more transparent areas are darker.  So we disable the alpha multiply.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Back To Front Revisited&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If we render back to front, we can use GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA, and not premultiply in shader.  There's just one problem: alpha poke-through. Basically if you layer four polygons on top of each other, each with 50% opacity, the end result will be very close to 50% opacity, but the correct result should be 1-0.5^4, or 93.75% opaque.  So with "standard" back-to-front opacity we can't later blit our accumulated texture.&lt;br /&gt;&lt;br /&gt;It turns out we can work around this with some GL voodoo:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Init the background to black opaque (0,0,0,1).&lt;/li&gt;&lt;li&gt;Set the blend function to GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA for color but use GL_ZERO, GL_ONE_MINUS_SRC_ALPHA for the alpha coefficients.  This requires glBlendFunctionSeparate and GL 1.4.&lt;/li&gt;&lt;li&gt;When it comes time to mix down, use GL_ONE, GL_SRC_ALPHA&lt;/li&gt;&lt;/ol&gt;Um...what?&lt;br /&gt;&lt;br /&gt;Here's what's going on: we need to use multiplication to "accumulate" opacity.  But since multiplication tends to move colors toward zero, and zero is transparent, multiplying our fragments alpha together tends to make things more transparent.  So this scheme is based on treating 1.0 as transparent and 0.0 as opaque.  Let's review those steps:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Since 1 is now transparent, we init our buffer to alpha=1 for transparency.&lt;/li&gt;&lt;li&gt;By using alpha coefficients of GL_ZERO, GL_ONE_MINUS_SRC_ALPHA, we are multiplying the destination alpha by the source alpha.  So here we have our "multiplying" to build up opacity.  By using GL_ONE_MINUS_SRC_ALPHA we invert our alpha - the fragment outputs 0 = transparent and this converts it to 1 = transparent.  The existing alpha in the framebuffer is already inverted.&lt;/li&gt;&lt;li&gt;When we go to actually composite, we use GL_SRC_ALPHA instead of GL_ONE_MINUS_SRC_ALPHA because our source alpha is already inverted.  (The source factor is GL_ONE because, like all pre-made blend mixes, we are pre-multiplied.)&lt;/li&gt;&lt;/ol&gt;It took me a little bit of head scratching to realize that the blend equation (a*b+c*d) can be used as a multiply instead of an add.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3829533617520952338?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3829533617520952338/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/alpha-blending-back-to-front-front-to.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3829533617520952338'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3829533617520952338'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/alpha-blending-back-to-front-front-to.html' title='Alpha Blending, Back To Front, Front To Back'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-4919494548913020569</id><published>2010-02-12T18:41:00.002-05:00</published><updated>2010-02-12T18:43:05.725-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Email'/><title type='text'>Multipart MIME and Apple Mail</title><content type='html'>I finally figured out why attachments from our bug report script don't have icons in Apple mail: Apple mail requires multipart/mixed as the MIME type, while Thunderbird will accept multipart/related. &lt;br /&gt;&lt;br /&gt;Apple mail also cares about Content-disposition; it will show an icon for "attachment"-style disposition, even for text files, but it will show the text (with no markings showing it is an attachment) for "inline" style.  Thunderbird shows the full text, with horizontal rules, no matter what.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-4919494548913020569?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/4919494548913020569/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/multipart-mime-and-apple-mail.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4919494548913020569'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4919494548913020569'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/multipart-mime-and-apple-mail.html' title='Multipart MIME and Apple Mail'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-4678969904180308389</id><published>2010-02-10T12:00:00.003-05:00</published><updated>2010-02-10T12:14:36.761-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>How To Change Your UV Map on the Fly</title><content type='html'>I've been playing with "stupid UV map tricks" lately - the  basic idea is to (in the fragment shader) change the texture coordinates before fetch.  For example, given a texture divided into equally useful grid squares, we can on a per-grid square basis change which square we're in, to make the texture repetition less obvious.  Why do this?&lt;br /&gt;&lt;ul&gt;&lt;li&gt;You can make your textures look less repetitive without making meshes more complex.&lt;/li&gt;&lt;li&gt;Since the effect is in-shader, it can be turned off on lower end machines - scalability!&lt;/li&gt;&lt;/ul&gt;But there's this one bit of fine print, and it escaped me for about four months: if you want to "swizzle" the UV map in a discontinuous way (e.g. using "fract", "mod", etc.) you need to use the explicit gradient texture fetch functions!  If you don't, you get artifacts at the discontinuities.&lt;br /&gt;&lt;br /&gt;Huh?!?!&lt;br /&gt;&lt;br /&gt;In order to understand why this is necessary, you first have to understand how the hardware selects a mipmap level, and to understand that you have to understand how OpenGL generates derivatives.&lt;br /&gt;&lt;br /&gt;First the derivatives.  Most of the video cards I know about generate derivatives of a shader variable by "cross-differencing" - that is, a 2x2 block of pixels is run using the same shader, and when the shader hardware gets to the derivative (dFdx, and dYdx) it simply subtracts the interim values from the four pixels to find how much they "change" in the box.  In other words, the derivative function in GLSL works by discreet per-pixel sampling.&lt;br /&gt;&lt;br /&gt;(BTW this is why when you screw up code that needs to treat derivatives carefully, often you'll get 2x2 pixel artifacts.)&lt;br /&gt;&lt;br /&gt;These derivatives allow the graphics card to select a LOD.  At the sight of a texture fetch, the card can do a derivative operation on the input texture coordinates and see how fast they change per pixel.  The faster they change, the lower the effective texture res and the lower LOD mip-map we need.  That is how the card "knows" to use the lower mip-maps even when you use expressions for your texture coordinates - the derivative is taken on the entire expression.&lt;br /&gt;&lt;br /&gt;But...what happens when you have a discontinuity in your UV map?  Take a simple case like "fract".  If you "fract" a wrapping texture, you will quite possibly see an artifact at the edges.  This is because, right at the edge, the rate of change of the UV map is &lt;i&gt;much&lt;/i&gt; higher than before, as it "jumps" from one edge of the texture to the other.  High rate of change = low LOD - the graphics card goes and selects the lowest level LOD it has!&lt;br /&gt;&lt;br /&gt;(If you don't know what's in your lowest mip, you might not know where the color was coming from.)&lt;br /&gt;&lt;br /&gt;The solution is &lt;a href="http://www.opengl.org/registry/specs/ARB/shader_texture_lod.txt"&gt;here&lt;/a&gt;: texture2DGradARB.  This function lets you separately specify the texture coordinates and the derivatives.  Here's a simple example.  Imagine you have this:&lt;br /&gt;&lt;blockquote&gt;vec2 uv_swizzled = fract(uv);&lt;br /&gt;vec4 rgba = texture2D(my_tex, uv_swizzled);&lt;br /&gt;&lt;/blockquote&gt;That example will create a  few pixels of low-mipmap texture at the discontinuity (where the texture goes from 1 back to 0).  To use texture2DGradARB, you do this:&lt;br /&gt;&lt;blockquote&gt;vec2 uv_swizzled = fract(uv);&lt;br /&gt;vec4 rgba = texture2DGradARB(my_tex,uv_swizzled,dFdx(uv),dFdy(uv));&lt;br /&gt;&lt;/blockquote&gt;By using the original (continuous) texture coordinates for the derivative, but the modified ones for the fetch, you can have discontinuous fetches with no LOD artifacts.&lt;br /&gt;&lt;br /&gt;NVidia and ATI cards don't respond the same way to discontinuous coordinates, but both will produce artifacts, and both are right to do so.&lt;br /&gt;&lt;br /&gt;One last note.  From the shader texture LOD extension:&lt;br /&gt;&lt;pre&gt;   Mipmap texture fetches and anisotropic texture fetches&lt;br /&gt;  require an implicit derivatives to calculate rho, lambda&lt;br /&gt;  and/or the line of anisotropy.  These implicit derivatives&lt;br /&gt;  will be undefined for texture fetches occuring inside&lt;br /&gt;  non-uniform control flow or for vertex shader texture&lt;br /&gt;  fetches, resulting in undefined texels.&lt;br /&gt;&lt;/pre&gt;I can tell you from experience that a number of my artifacts have come from conditional code flow.  I believe that by non-uniform control flow they mean the case where the shader branches are not all taken the same way for a 2x2 block, but I am not sure.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-4678969904180308389?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/4678969904180308389/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/how-to-change-your-uv-map-on-fly.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4678969904180308389'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4678969904180308389'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/how-to-change-your-uv-map-on-fly.html' title='How To Change Your UV Map on the Fly'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-2659445438176624982</id><published>2010-02-10T09:19:00.003-05:00</published><updated>2010-02-10T09:36:40.496-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Running Out of Derivative Res</title><content type='html'>In a previous post I &lt;a href="http://hacksoflife.blogspot.com/2009/11/per-pixel-tangent-space-normal-mapping.html"&gt;went over the math&lt;/a&gt; behind generating the coordinate system for normal mapping in a pixel shader, which allows you to use tangent space bump mapping without encoding coordinate axes on your vertex mesh.  (In X-Plane we do this so that we can allow authors to add bump maps to "unmodified" meshes.)&lt;br /&gt;&lt;br /&gt;One of the problems with writing shaders is that it can be write-once, debug everywhere.  As it turns out, this technique has a problem that I can repro on a GF8800 but not HD4870.  On the 8800, I run out of precision in my derivative (dFdx and dFdy) functions.&lt;br /&gt;&lt;br /&gt;In the scene in question, the UV map is generated in the vertex shader via projection off the world-space input vertices and the input mesh is big - 300 x 300 km in fact.  (It is of course the base terrain.)&lt;br /&gt;&lt;br /&gt;This means that the UV coordinates are pretty big too, particularly for highly scaled up textures.  And that means that the effective resolution limit of the &lt;i&gt;texture coordinates&lt;/i&gt; may be larger than one pixel.&lt;br /&gt;&lt;br /&gt;When this happens, the result is a derivative that will be inconsistent across pixels, and the basis for the bump map will be corrupted on a per-pixel level.&lt;br /&gt;&lt;br /&gt;Work-arounds?  I can think of two:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Modify the texture coordinate generation system to produce higher precision UV maps.&lt;/li&gt;&lt;li&gt;Modify the shader to generate basis vectors from the projection parameters (rather than by "sampling" via the UV map) in the texture coordinate generation case.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-2659445438176624982?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/2659445438176624982/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/running-out-of-derivative-res.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2659445438176624982'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2659445438176624982'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/running-out-of-derivative-res.html' title='Running Out of Derivative Res'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-2596000979956065715</id><published>2010-02-08T14:18:00.002-05:00</published><updated>2010-02-08T14:27:01.904-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Linux'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>glXGetProcAddressARB Syntax</title><content type='html'>I was slightly astounded to read that glxGetProcAddressARB is declared like this:&lt;br /&gt;&lt;blockquote&gt;void (*glXGetProcAddressARB(const GLubyte *procName))();&lt;br /&gt;&lt;/blockquote&gt;Wha?  Well, fortunately when you read the spec you'll note that they're just being clever...that's very strange C for&lt;br /&gt;&lt;blockquote&gt;typedef void (*GLfunction)();&lt;br /&gt;extern GLfunction glXGetProcAddressARB(const GLubyte *procName);&lt;br /&gt;&lt;/blockquote&gt;In other words, unlike all other operating systems, which define the returned type of a proc query as a void *, GLX typedefs it as a pointer to a function taking no arguments and returning nothing.&lt;br /&gt;&lt;br /&gt;Why this is useful is beyond me, but if you are like us and call one of wgl, AGL, or GLX, you may have to cast the return of glXGetProcAddressARB to (void *) to make it play nice with the other operating systems.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-2596000979956065715?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/2596000979956065715/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/glxgetprocaddressarb-syntax.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2596000979956065715'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/2596000979956065715'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/glxgetprocaddressarb-syntax.html' title='glXGetProcAddressARB Syntax'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-5995813391158092442</id><published>2010-02-04T06:18:00.000-05:00</published><updated>2010-02-04T06:18:00.744-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>How To Scroll the OpenGL World</title><content type='html'>So...despite my best efforts to post ridiculous and stupid ideas to this blog, there appear to still be people reading and commenting on it.  Chris and I don't really understand this at all, but what the heck: this post is aimed at soliciting feedback.  I'm wondering if I've missed a very basic case in a very basic problem.&lt;br /&gt;&lt;br /&gt;The problem is the scrolling world.  If you have a 3-d "world" in your game implemented in OpenGL, you're up against the limited (32-bit at best) coordinate precision of the GL.  As your user migrates around the world and gets farther away from the origin, you start to lose bits of precision.  At some point, you have to reset the coordinate system.&lt;br /&gt;&lt;br /&gt;I see three fundamental ways to address this problem:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;Stop the world and transform it.  This is what X-Plane does now, and it's not very good.  We bring multi-core processing into play, but what we're really bottlenecked by is the PCIe bus - many  our meshes are on the GPU, and have to come back to the CPU for transformation.&lt;/p&gt;&lt;p&gt;(Transform feedback?  A cool idea, but in my experience GL implementations often respond quite badly to having to "page out" meshes that are modified on card.)&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Double-buffer.  Make a second copy of the world and transform it, then swap.  This lets us change coordinate systems quickly (just the time of a swap) but requires enough RAM to have two copies of every scene-graph mesh in memory at the same time.  We rejected this approach because we often don't have that kind of memory around.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;Use local coordinate systems and transform to them.  Under this approach, each small piece of the world is in its own local coordinate system, and only the relationship between these "local" coordinate systems and "the" global coordinate system is changed.&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;This third approach strikes me as the most promising one, but it also strikes me as difficult from a mesh-cracking standpoint.  I don't see any way to guarantee that two triangles emitted under different matrix transforms will have the same final device coordinates, and if they don't, there can be mesh artifacts.&lt;br /&gt;&lt;br /&gt;So that's my question: is there a way to connect two meshes under different coordinate transforms without cracking?  Is there a limited set of matrix transforms that will, either in theory or practice produce acceptable results?  Do game engines just hack around this by using clever authoring (e.g. overlap the tiles slightly and cheat on the Z buffer)?&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-5995813391158092442?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/5995813391158092442/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/how-to-scroll-opengl-world.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5995813391158092442'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/5995813391158092442'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/how-to-scroll-opengl-world.html' title='How To Scroll the OpenGL World'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-1468862816118782054</id><published>2010-02-03T18:07:00.003-05:00</published><updated>2010-02-03T18:13:10.123-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='Quotes'/><category scheme='http://www.blogger.com/atom/ns#' term='STL'/><title type='text'>The STL Is Not An Abstraction</title><content type='html'>I came to a realization the other day, having been burned by the STL for approximately the 100,000th time.  Okay here goes that quotable crap again:&lt;br /&gt;&lt;blockquote&gt;The STL is not an abstraction; it is a shortcut.&lt;/blockquote&gt;In computer programming, an abstraction is something that hides the details.  Abstractions let us get stuff done, and most of the time they &lt;a href="http://www.joelonsoftware.com/articles/LeakyAbstractions.html"&gt;leak&lt;/a&gt;.  Is the STL the leakiest abstraction in the universe?&lt;br /&gt;&lt;br /&gt;No.  It's not an abstraction at all.  Abstractions hide implementation from you - the STL simply &lt;i&gt;provides&lt;/i&gt; implementation.&lt;br /&gt;&lt;br /&gt;An indication that the STL is an abstraction would be that you could change the implementation of an STL algorithm or container and not notice.  Does the STL meet that criteria?  I don't think so, at least not in any sane way.&lt;br /&gt;&lt;br /&gt;With the STL, you need to know &lt;i&gt;all&lt;/i&gt; of the fine print for any algorithm or class you do.  Picking the type means picking an algorithm or data structure for its strengths and weaknesses.  For example, if you pick vector, you are picking the following:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;A simple, compact representation.&lt;/li&gt;&lt;li&gt;Blazingly fast random access iteration.&lt;/li&gt;&lt;li&gt;The copy constructor of your data is going to be called a gajillion times.&lt;/li&gt;&lt;li&gt;Mutating the size of the vector is going to hose outstanding iterators.&lt;/li&gt;&lt;li&gt;Non-far-end insertion and deletion cost a fortune.&lt;/li&gt;&lt;/ul&gt;That's how vectors roll.  A container abstraction might hide these things; picking vector &lt;i&gt;prescribes&lt;/i&gt; what will happen, pretty exactly.&lt;br /&gt;&lt;br /&gt;And that's okay; typing vector&lt;int&gt; is still faster and less error prone than typing int * and remembering not to screw up the dynamic memory allocation.  But let's recognize what the STL is: a way to make certain known containers and algorithms much faster to put into your code - not a way to write code without knowing what your algorithms and containers do!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-1468862816118782054?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/1468862816118782054/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/stl-is-not-abstraction.html#comment-form' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1468862816118782054'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1468862816118782054'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/stl-is-not-abstraction.html' title='The STL Is Not An Abstraction'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-4150744820319714975</id><published>2010-02-01T21:08:00.003-05:00</published><updated>2010-02-01T21:50:30.403-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><title type='text'>Moore's Law and Openness</title><content type='html'>If you look back at Windows and how the west was won, you'll see a story of network effects and compatibility: an unbroken chain of being able to run old apps unmodified from DOS to Windows, and an architecture (x86) that we're still stuck with today. If there are two lessons to take away, it might be:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Software takes forever to die - it's really hard to throw it out and start over again.&lt;/li&gt;&lt;li&gt;Network effects are very strong - once all the apps are on Windows, everyone wants to run Windows.  Once everyone runs Windows, we want to write apps for Windows.&lt;/li&gt;&lt;/ul&gt;I realize that this blog article might look really, really stupid in a year or two (and in that case, all hail Google, our new overlords).  But...the strong networking effects in the embedded games space all point towards the iphone.  App developers know that if you want to be on one platform to make money, you have to look at the iphone first, even if you hate Objective C.  And if you want to run apps, the iPhone is in its own category.  (Just spend a car ride with an iPhone owner and you'll see..."Look, you can flick a ball of paper".  I can't knock it - it's a &lt;i&gt;fun&lt;/i&gt; app!)&lt;br /&gt;&lt;br /&gt;What's weird here is that the iPhone is pretty much invented out of whole cloth.  It doesn't run software from any other platform, it builds its UI off of Objective C and Cocoa (which, to the non-Kool-Aid drinking half of the Apple third party development community looks like a new way to force us to use what we've been ignoring for years) and Apple has had the device locked up from day 1.  This couldn't be more different than how Windows gained domination.  So how did we get here?&lt;br /&gt;&lt;br /&gt;Clearly having a beautiful device way before everyone else makes a huge difference.  But I want to focus on another idea: is it possible that technology "productivity dividends" have fundamentally changed the calculus of building a new platform?&lt;br /&gt;&lt;br /&gt;Development of applications for the original Macintosh was, by modern standards, brutal.  You had 128K for the OS and your app, and it was a tight squeeze.  Every line of code was performance critical and size critical.  Those first GUI-based apps were written by some seriously brilliant programmers who had to sweat bullets.&lt;br /&gt;&lt;br /&gt;Fortunately for us working programmers, computers are now much much faster and bigger.  Instead of writing apps that are millions of times faster (which no one would care about - at some point, the window appeared to open instantly and any speed improvement is moot) we write at a higher level of abstraction, which means we write apps more quickly.  To draw a supply and demand analogy, apps for the iphone (or any computer now) are less expensive in man hours because we have better tools that trade hardware horsepower for ease of development.&lt;br /&gt;&lt;br /&gt;So that might partly explain why Apple now has 140,000 apps or so on their phone.  It's not &lt;i&gt;that&lt;/i&gt; hard to write them.  But what about this business where Apple hand-picks apps and rejects the ones they don't like?  My first reaction as an iPhone app developer was "hrm....it sure looks like a real computer, but man is it &lt;i&gt;locked down&lt;/i&gt;."  It certainly wasn't what I was used to.&lt;br /&gt;&lt;br /&gt;The iPhone is  surprising device to develop for, because as an app developer, you aren't given the tools to hose the machine.  As a Windows developer you might be grumpy that, after decades, Microsoft has finally said that you can't dump files randomly in the system folder without user permission, but the iPhone takes things more seriously.  It's somebody's phone, damnit, and your app isn't getting outside of its sandbox, let alone into the OS.&lt;br /&gt;&lt;br /&gt;I see the fact that the iPhone has successfully developed a third party market despite being locked down as an indication that user demands may be changing.  In the old world, where apps were rare and expensive to write, what we wanted was: more software.  Perhaps in the new world, where writing apps isn't so hard, what users want is an experience that focuses on quality rather than quantity of apps.&lt;br /&gt;&lt;br /&gt;(Or to put it another way: if you would agree to audit every single piece of software that a user might put on their Windows computer and guarantee that none of it was going to wreck that computer, you'd have a service you could sell.  The iPhone comes with that out of the box.)&lt;br /&gt;&lt;br /&gt;Of course, I could be missing the point entirely; the iPhone cuts distributors out of the loop, with sales going only to store and studio - perhaps that's enough to launch 140,000 apps.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-4150744820319714975?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/4150744820319714975/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/moores-law-and-openness.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4150744820319714975'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/4150744820319714975'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/02/moores-law-and-openness.html' title='Moore&apos;s Law and Openness'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7787596486181748072</id><published>2010-01-31T14:01:00.002-05:00</published><updated>2010-01-31T14:51:01.194-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>To Strip or Not To Strip</title><content type='html'>In this post I will try to explain why a performance-focused OpenGL application like X-Plane does not use triangle strips.  Since triangle strips were the best way to draw meshes a few years back, a new user searching for information might be confronted by a cacophony of tutorials advocating triangle strips and game developers saying "indexed triangles are better" without explaining why.  Here's the math.&lt;br /&gt;&lt;br /&gt;Please note: this article applies to OpenGL desktop applications, typically targeting NVidia and ATI GPUs.  In the mobile/embedded space, it's a very different world, and certain GPUs (cough, cough, PowerVR, cough cough) have some fine print attached to them that might make you reconsider.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Why Triangle Strips Are Good&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If you are drawing a bunch of connected triangles, the logic in favor of triangle strips is very simple: the number of vertices in the strip will be almost 66% fewer than the number you'd have if you simply made triangles.  The longer the strip, the closer to that savings you get.  Since geometry throughput is generally limited by total vertex count, this is a big win.&lt;br /&gt;&lt;br /&gt;Ten years ago, that's all you needed to know.  Of course, making triangle strips is not so easy - some meshes simply won't form strips.  The general idea was to make as many strips as you can, and draw the rest of your triangles as "free triangles" (e.g. GL_TRIANGLES, where each triangle is 3 vertices, and no vertices are shared).&lt;br /&gt;&lt;br /&gt;(By the way, to see how to use the tri_stripper library to create triangle strips, look at the function DSFOptimizePrimitives in the &lt;a href="http://dev.x-plane.com/cgit/cgit.cgi/xptools.git/tree/src/DSF/DSFPointPool.cpp"&gt;X-Plane scenery tools code&lt;/a&gt;. Why DSFLib does this will have to be explained in another post, but suffice it to say, there is no hypocrisy here: X-Plane disassembles the triangle strips in the DSF into "free triangles" on load.)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Indexing Is Better&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;In an indexed mesh, each vertex is stored only once, and the triangles are formed from a set of indices.  (In OpenGL this is done by moving from glDrawArrays to glDrawElements.)  With an index, you pay more (2 or 4 bytes) per each vertex, but you don't ever have to repeat the geometry of a vertex.&lt;br /&gt;&lt;br /&gt;When is it worth it to index?  It depends on the size of your indices and vertices, but it is almost always a win.  For example, in an X-Plane object our vertices are 32 bytes (XYZ, normal, one UV map, all floating point) and our indices are 4 bytes (unsigned integer indices).  Thus a vertex is 8x more expensive than a vertex.  So if we can reduce 1/8th of the geometry via sharing, we will have a win.&lt;br /&gt;&lt;br /&gt;Consider a simple 2-d grid: even with triangle strips, each adjacent strip except the edges are going to share a common edge.  Thus if we use indexing, our 2-d mesh is going to have a savings that nearly approaches 2x for the geometry!  That is way more than enough to pay for the cost of the indices.&lt;br /&gt;&lt;br /&gt;So the moral of the story is: any time your geometry has shared vertices, use indexing.  Note that this won't always happen.  If you have a mesh of GL_POINTS, you will have no sharing, so indexing is a waste.  In X-Plane, our "trees" are all individual quads, no sharing, so we turn off indexing because we know the indexing will do no good.&lt;br /&gt;&lt;br /&gt;But for most "meshed" art assets (e.g. anything someone built for you in a 3-d modeler) it is extremely likely that indexing will cut down the total amount of data you have to send to the GPU, and that is a good thing.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Triangle Strips Aren't That Cool  When We Index&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Now in the old school world, a triangle strip cut the amount of geometry down by almost 3x.  Awesome!  But in the indexed world, a triangle strip only cuts down the size of the &lt;i&gt;index list&lt;/i&gt; by 3x.  That is...not nearly as impressive.  In fact, in X-Plane's case it is only 1/8th as impressive as it would have been for non-indexed geometry.&lt;br /&gt;&lt;br /&gt;The take-away thing to observe: once we start indexing (which really makes geometry storage efficient) triangle strips aren't nearly as important as they used to be.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Restarting a Primitive Hurts&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;So far we've talked about ideal cases, where your triangle strips are very long, so we really approach a 3x savings.  Here's the hitch: in real life triangle strips might be very short.&lt;br /&gt;&lt;br /&gt;The problem with triangle strips is that we have to tell the card where the triangle strips begin and end, and that can get expensive.  You might have to issue a separate glDrawElements call for each strip.&lt;br /&gt;&lt;br /&gt;You don't want to make additional CPU calls into the GL to minimize the size of a buffer (the index buffer) that is already held in VRAM.  CPU calls are much slower.  And this is why X-Plane doesn't use strips internally: it's faster to be able to make one draw call only for mesh, even if it means a slightly bigger element list.&lt;br /&gt;&lt;br /&gt;Now if you are a savvy OpenGL developer you are probably screaming one of two things:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;What about glMultiDrawElements?  I point you to &lt;a href="http://lists.apple.com/archives/mac-opengl/2005/Dec/msg00039.html"&gt;here&lt;/a&gt;, and &lt;a href="http://www.opengl.org/registry/specs/NV/primitive_restart.txt"&gt;here&lt;/a&gt;.  Basically both Apple and NVidia are suggesting that the multi-draw case may not be as ball-bustingly fast as it could be.  There is always a risk that the driver decomposes your carefully consolidated strips into individual draw calls, and at that point you lose.&lt;/li&gt;&lt;li&gt;What about primitive restart?  Well, it's nvidia only, so if you use it, you need to case your basic meshing code to handle its not being there.  And even if it is there, you pay with an extra index per restart.  If you have really good strips, this might be a win, but when the strips get small, you're starting to eat away at the benefit of shrinking down your indices in the first place.  (The worst case is a triangle soup with no sharing, so you get no benefit from tri strips and you have to put a "restart" primitive into every 4th slot.)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;And this brings me to one more concern: even if you do have some nice triangle strips in your mesh, you may have free triangles too, and in that case you're going to have to make two separate batch calls (GL_TRIANGLE_STRIP, GL_TRIANGLES) for the two "halves" of the mesh.  So even if you are getting a triangle strip win, you're probably going to double the number of &lt;i&gt;real&lt;/i&gt; draw calls (even with multi-draw) just to shrink an index list down.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Index Triangles&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Thus the X-Plane solution: any time we have a mesh, we use indexed triangles and we go home happy.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;We always draw every mesh in only one draw call.&lt;/li&gt;&lt;li&gt;We share vertices as much as possible.&lt;/li&gt;&lt;li&gt;We are in no way dependent on the driver handling multi-draw or having a restart extension.&lt;/li&gt;&lt;li&gt;We run at full speed even if the actual mesh doesn't turn to strips very well.&lt;/li&gt;&lt;li&gt;The code handles only one case.&lt;/li&gt;&lt;/ul&gt;As a final note, this post doesn't discuss cache coherency - that is, if you are going to present the driver with a "triangle soup", what is the best order?  That will have to be another post, but for now understand that the point of this post is "indexed triangles are better than strips" - I am not saying "order doesn't matter" - cache coherency and vertex order can matter, no matter how you get the vertices into the GPU.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7787596486181748072?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7787596486181748072/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/to-strip-or-not-to-strip.html#comment-form' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7787596486181748072'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7787596486181748072'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/to-strip-or-not-to-strip.html' title='To Strip or Not To Strip'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8226211902973431355</id><published>2010-01-29T11:37:00.002-05:00</published><updated>2010-01-29T11:49:59.986-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Debugging'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><title type='text'>The Devil Is In the Details</title><content type='html'>I seem to have become horribly addicted to &lt;a href="http://stackoverflow.com/"&gt;Stack Overflow&lt;/a&gt;.  It makes sense, but I just feel a compulsion to answer other people's questions about OpenGL.&lt;br /&gt;&lt;br /&gt;But there is one kind of question that drives me a little bit nutty...it goes like this:&lt;br /&gt;&lt;blockquote&gt;I am new to OpenGL and I hope someone can help me.  I am drawing a series of interlocking mobeus rings using glu nurb tessellateors, GL_TEX_ENV_COMBINE, a custom separate alpha blending mode, the stencil buffer, and polygon offset.&lt;br /&gt;&lt;br /&gt;For some reason one of my polygons are clipped.  If I change the combine mode to add, the purple ones move to the left.  If I change the polygon offset, the problem persists.&lt;br /&gt;&lt;br /&gt;Any ideas?&lt;/blockquote&gt;My fellow OpenGL programmers: Stack Overflow is not a debugging service.&lt;br /&gt;&lt;br /&gt;Stack Overflow is a great idea, and the site execution is really pretty good: automatic syntax formatting appropriate to code, tagging, search works pretty well. It is good for answering questions.&lt;br /&gt;&lt;br /&gt;But a post like the above: it's not a question, it's a cry for help.  (The answer, technically, is "yes", but I don't want to post that and get bad karma.)  There are about a million things that could be going wrong from the fundamental design to the nuts and bolts.&lt;br /&gt;&lt;br /&gt;In my experience, OpenGL bugs fall into three categories:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;p&gt;There is a one-off stupid mistake deep in the implementation that causes all hell to come down.  Fixing the bug requires the usual techniques (&lt;a href="http://hacksoflife.blogspot.com/2010/01/debugging-glsl.html"&gt;divide and conquer and printf&lt;/a&gt;) until the bug is found and fixed.  Stack crawl is not the right tool - any programmer who is going to fix this needs to be able to modify and re-run the app repeatedly, and no one is going to do this for you for free anyway.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The overall algorithm design is wrong because of the design limits of the GL.  Another programmer could at leaset &lt;i&gt;tell&lt;/i&gt; you that you have this problem, but only if you know enough to ask the right questions.  And if you know enough to ask, heck, you probably wouldn't have designed the code this way in the first place.&lt;/p&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;The GL implementation has a known bug.  This is the one case where stack crawl can help, but the above question is not that.  The programmer needs to have cut the problem all the way down to the one mysterious behavior (e.g. my color is showing up in one of my vertex attributes but the GL spec says &lt;a href="http://hacksoflife.blogspot.com/2010/01/ive-got-blues.html"&gt;this should not happen&lt;/a&gt;).  In this case, at least having confirmation from other programmers that the bug is really in library code helps provide closure to the investigation.&lt;/p&gt;&lt;/li&gt;&lt;/ol&gt;My rant here is directed against case 1.  If you need to post a long and detailed description of your code (as opposed to a question), you're not really asking a question, you're asking for someone to do your job for you.&lt;br /&gt;&lt;br /&gt;Enough blogging, I'm going to go back to being grumpy now.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8226211902973431355?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8226211902973431355/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/devil-is-in-details.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8226211902973431355'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8226211902973431355'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/devil-is-in-details.html' title='The Devil Is In the Details'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-7278189988222989652</id><published>2010-01-28T19:45:00.003-05:00</published><updated>2010-01-28T19:59:03.172-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='NVidia'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>I've Got the Blues</title><content type='html'>I have learned many things today - some of which you may already know.  Did you know that in German, to be "blue" means to be drunk, not sad?  Maybe I will have a blue Christmas next year!&lt;br /&gt;&lt;br /&gt;I learned this while working with alpilotx on a quirky bug: experimental instancing code was causing the instanced geometry to look completely goofy and turning the rest of the scene pretty much completely blue.  The bug appeared only on NVidia hardware on Linux.&lt;br /&gt;&lt;br /&gt;Well if you read the &lt;a href="http://developer.download.nvidia.com/opengl/glsl/glsl_release_notes.pdf"&gt;fine print&lt;/a&gt; closely, you'll find this:&lt;br /&gt;&lt;blockquote&gt;NVIDIA’s GLSL implementation therefore does not allow built-in vertex attributes to&lt;br /&gt;collide with a generic vertex attributes that is assigned to a particular vertex attribute&lt;br /&gt;index with glBindAttribLocation. For example, you should not use gl_Normal (a&lt;br /&gt;built-in vertex attribute) and also use glBindAttribLocation to bind a generic vertex&lt;br /&gt;attribute named “whatever” to vertex attribute index 2 because gl_Normal aliases to&lt;br /&gt;index 2.&lt;br /&gt;&lt;/blockquote&gt;This is really too bad, as the GL 2.1 specification says:&lt;br /&gt;&lt;blockquote&gt;There is no aliasing among generic attributes and conventional attributes. In&lt;br /&gt;other words, an application can set all MAX VERTEX ATTRIBS generic attributes&lt;br /&gt;and all conventional attributes without fear of one particular attribute overwriting&lt;br /&gt;the value of another attribute. &lt;/blockquote&gt;I can report, with a full head of steam and outrage, that the current NVidia drivers on Linux definitely work the way NVidia says they do, and not the way the spec would like them to.  Documenting what their code does...the nerve of it!  Those NVidia driver writes!&lt;br /&gt;&lt;br /&gt;What?  You already knew this? Ha ha, so did I, just kidding, I was just quizzing you...&lt;br /&gt;&lt;br /&gt;I don't think this is really news at all - I think I'm just really late to the party.  In particular, X-Plane 8 and 9 run all of their shaders entirely using the built-in attributes to pass per-vertex information; sometimes that information is quite heavily bastardized to make it happen.&lt;br /&gt;&lt;br /&gt;I'm sure there are reasons why this is evil, but I can tell you why we did it: it allows us to have a unified code path for scene graph, mesh, and buffer management.  Only our shader setup code is actually sensitive to what the actual hardware is capable of doing - the rest runs on anything back to OpenGL 1.2.1.&lt;br /&gt;&lt;br /&gt;(This is actually not 100% true. In some cases we will tag additional attributes to our vertices only on a machine with GLSL - this is a simple optimization during mesh build-up to save time and space for machines that will never use the extra attributes anyway.  An example of this is the basis vectors for billboarding that are attached to trees: no GLSL means no billboarding the trees, so we drop the extra basis vectors.)&lt;br /&gt;&lt;br /&gt;The moral of the story: let the linker pick your attribute indices.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-7278189988222989652?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/7278189988222989652/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/ive-got-blues.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7278189988222989652'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/7278189988222989652'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/ive-got-blues.html' title='I&apos;ve Got the Blues'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8618651530008571907</id><published>2010-01-28T13:35:00.002-05:00</published><updated>2010-01-28T13:41:23.991-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><title type='text'>Templating Functions</title><content type='html'>(This is a rehash of an answer I posted on Stack Overflow, after reading the previous posts and experimenting...probably bad form to report here, but I want all my C++ drek in one place.)&lt;br /&gt;&lt;br /&gt;Template parameters can be either parameterized by type (typename T) or by value (int X).&lt;br /&gt;&lt;br /&gt;The "traditional" C++ way of templating a piece of code is to use a functor - that is, the code is in an object, and the object thus gives the code unique type.&lt;br /&gt;&lt;br /&gt;When working with traditional functions, this technique doesn't work well, because a change in type doesn't indicate a specific function - rather it specifies only the signature of many possible functions. So:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;template&lt;typename&gt;&lt;br /&gt;int do_op(int a, int b, OP op)&lt;br /&gt;{&lt;br /&gt;return op(a,b,);&lt;br /&gt;}&lt;br /&gt;int add(int a, b) { return a + b; }&lt;br /&gt;...&lt;br /&gt;int c = do_op(4,5,add);&lt;/typename&gt;&lt;/code&gt;&lt;/blockquote&gt;Isn't equivalent to the functor case. In this example, do_op is instantiated for all function pointers whose signature is int X (int, int). The compiler would have to be pretty aggressive to fully inline this case. (I wouldn't rule it out though, as compiler optimization has gotten pretty advanced.)&lt;br /&gt;&lt;br /&gt;One way to tell that this code doesn't quite do what we want is:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;int (* func_ptr)(int, int) = add;&lt;br /&gt;int c = do_op(4,5,func_ptr);&lt;/code&gt;&lt;/blockquote&gt;is still legal, and clearly this is not getting inlined. To get full inlining, we need to template by value, so the function is fully available in the template.&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;typedef int(*binary_int_op)(int, int); // signature for all params&lt;br /&gt;template int add(int a, int b) { return op(a,b); }&lt;br /&gt;int add(int a, b) { return a + b; }&lt;br /&gt; ...&lt;br /&gt;int c = do_op(4,5);&lt;/code&gt;&lt;/blockquote&gt;In this case, each instantiated version of do_op is instantiated with a specific function already available. Thus we expect the code for do_op to look a lot like "return a + b". (Lisp programmers, stop your smurking!)&lt;br /&gt;&lt;br /&gt;We can also confirm that this is closer to what we want because this:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;int (* func_ptr)(int,int) = add;&lt;br /&gt;int c = do_op&lt;func_ptr&gt;(4,5);&lt;/func_ptr&gt;&lt;/code&gt;&lt;/blockquote&gt;will fail to compile. GCC says: "error: 'func_ptr' cannot appear in a constant-expression. In other words, I can't fully expand do_op because you haven't given me enough info at compiler time to know what our op is.&lt;br /&gt;&lt;br /&gt;So if the second example is really fully inlining our op, and the first is not, what good is the template? What is it doing? The answer is: type coercion. This riff on the first example will work:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;template&lt;typename&gt;&lt;br /&gt;int do_op(int a, int b, OP op) { return op(a,b); }&lt;br /&gt;float fadd(float a, float b) { return a+b; }&lt;br /&gt;...&lt;br /&gt;int c = do_op(4,5,fadd);&lt;/code&gt;&lt;/blockquote&gt;That example will work! (I am not suggesting it is good C++ but...) What has happened is do_op has been templated around the signatures of the various functions, and each separate instantiation will write different type coercion code. So the instantiated code for do_op with fadd looks something like:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;convert a and b from int to float.&lt;/li&gt;&lt;li&gt;call the function ptr op with float a and float b.&lt;/li&gt;&lt;li&gt;convert the result back to int and return it.&lt;/li&gt;&lt;/ol&gt;By comparison, our by-value case requires an exact match on the function arguments.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8618651530008571907?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8618651530008571907/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/templating-functions.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8618651530008571907'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8618651530008571907'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/templating-functions.html' title='Templating Functions'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-515672290275093581</id><published>2010-01-27T23:09:00.000-05:00</published><updated>2010-01-27T23:09:00.028-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Debugging'/><category scheme='http://www.blogger.com/atom/ns#' term='OpenGL'/><title type='text'>Debugging GLSL</title><content type='html'>From a &lt;a href="http://hacksoflife.blogspot.com/2006/01/debugging-opengl.html"&gt;past post&lt;/a&gt;:&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;blockquote&gt;There are only two debugging techniques in the universe:&lt;p&gt;&lt;/p&gt; &lt;ol&gt;&lt;li&gt;printf.&lt;/li&gt;&lt;li&gt;/* */&lt;/li&gt;&lt;/ol&gt;&lt;/blockquote&gt;Is that true when writing GLSL shaders?  Yep.  Commenting out things is natively available.   What about printf?  The GLSL equivalent of printf is&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;gl_FragColor.rgba = vec4(stuff_i_want_to_see,...);&lt;/code&gt;&lt;/blockquote&gt;That is, you simply output an intermediate product to the final color, run your shader, then view something else.  This is how I debug some of the more complex shaders: I view each product in series to confirm that my intermediate values aren't broken.  Since the sim is running at 30 fps, I can move the camera and confirm that the values stay sane through a range of values.&lt;br /&gt;&lt;br /&gt;The numeric output is often not in a visible range - to get around that I often use a mix of abs, fract (to see just the lowest bits), scaling, and normalize() to sanitize the output.&lt;br /&gt;&lt;br /&gt;One app feature is critical: make sure you can reload your shaders in a heart-beat.  In X-Plane we have a hidden menu command to do this.  This way, you can move your printf, recompile the shaders, and see the change.&lt;br /&gt;&lt;br /&gt;A visual debugger is a useful tool for debugging C/C++ because you don't have to commit to what you will view before compiling - you can just print any intermediate product from the debugger.  For GLSL, make the recompile cycle fast, and you'll be able to simply edit the code in near-realtime.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-515672290275093581?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/515672290275093581/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/debugging-glsl.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/515672290275093581'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/515672290275093581'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/debugging-glsl.html' title='Debugging GLSL'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-1024100085402608785</id><published>2010-01-27T10:06:00.003-05:00</published><updated>2010-01-27T10:23:55.874-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='GLSL'/><title type='text'>A Tile Too Far</title><content type='html'>I've been playing with shading algorithms lately.  One such algorithm is the "number puzzle".  The basic idea is to take a repeating texture that is divided into sub-tiles and randomly move the tiles around.  This is implemented in-shader by separating the UV coordinates and randomizing the bits that represent the "tile".  (This is usually all but the lowest N bits.)  The tile choice is made by sampling a random noise map, and the UV input to that comes from the upper bits so that it is stable (e.g. so we only switch tiles at the tile boundary).&lt;br /&gt;&lt;br /&gt;One nice property of the number puzzle is that if you don't have shaders, you simply get a repeating texture.  This is handy because the art assets and code doesn't have to be cased out for a fixed function case - we end up with uglier, but valid output.&lt;br /&gt;&lt;br /&gt;It occurred to me today that the number puzzle can be atlased - that is, the random tile we pick could be constrained by the upper bits of the UV map, so that (by using a broad "space" of UV coordinates) we can pick from a set of tiles within a larger texture.  This is a win because it means we can texture atlas and thus merge a bunch of differently tiled surfaces into one batch.&lt;br /&gt;&lt;br /&gt;There is just one problem with this technique, one that might be a deal breaker as long as fixed function is necessary: when the shader is off, the atlasing gets ignored and we end up with junk.  There really isn't a good way around this..wrapping + atlasing are, as far as I know, incompatible in the fixed function pipeline.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-1024100085402608785?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/1024100085402608785/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/tile-too-far.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1024100085402608785'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/1024100085402608785'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/tile-too-far.html' title='A Tile Too Far'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-8983720345952455673</id><published>2010-01-14T09:03:00.004-05:00</published><updated>2010-11-11T14:34:18.623-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='Performance'/><title type='text'>Fast Paths</title><content type='html'>When looking at code speed, you can put on two different hats:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;When &lt;i&gt;designing&lt;/i&gt; an API, you might ask: how do we prevent a slow-down in the fastest possible path?&lt;/li&gt;&lt;li&gt;When &lt;i&gt;implementing&lt;/i&gt; an API, you might ask: how does this affect overall performance?&lt;/li&gt;&lt;/ul&gt;They're not the same.  Consider, for example, OpenGL state shadowing.&lt;br /&gt;&lt;br /&gt;A well optimized OpenGL client program would not do this:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;glEnable(GL_TEXTURE_2D);&lt;br /&gt;glDrawArrays(GL_TRIANGLES, 0, 51);&lt;br /&gt;glEnable(GL_TEXTURE_2D);&lt;br /&gt;glDrawArrays(GL_TRIANGLES, 108, 51);&lt;br /&gt;&lt;/code&gt;&lt;/blockquote&gt;The second enable of texturing is totally unneeded.  The clever programmer would optimize this away.  But what does the OpenGL implementation do?  We have two choices:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Check the texture enable state before doing a glEnable.  In the case where the programmer didn't optimize, this saves an expensive texture state change, and in the case where the programmer did optimize, it is an unnecessary comparison, probably of one bit.&lt;/li&gt;&lt;li&gt;Do not check - always do the enable.  In the case where the programmer didn't optimize, the program is slow; in the case where the programmer did optimize, we deliver the fastest path.&lt;/li&gt;&lt;/ol&gt;In other words, it is a question of whether to optimize overall system performance in a world where programmers are sometimes stupid or lazy, or whether to make sure that those who write the fastest code get the fastest possible code.&lt;br /&gt;&lt;br /&gt;(In a real program, detecting duplicate state change is very difficult, since code flow can be dynamic.  For example, in X-Plane we draw only what is on screen.  Since the model that was drawn just before your model will change with camera angle, the state of OpenGL just before we draw will vary a lot.)&lt;br /&gt;&lt;br /&gt;From my perspective as a developer who tries to write really fast code, I don't care which one a library writer chooses, as long as the library clearly declares &lt;i&gt;what&lt;/i&gt; is fast and what is not.&lt;br /&gt;&lt;br /&gt;This was the motivation behind the "datarefs" APIs in the X-Plane SDK: a dataref is an opaque handle to a data source, and we have two sets of operations:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;"Finding" the dataref, where the link is made from the permanent string identifier to the opaque handle.  This operation is officially "slow" and client code is expected to take steps to avoid finding datarefs more than necessary, in performance critical locations, in loops, etc.  (Secretly finding was linear time for a while and is now log time, so it was never &lt;i&gt;that&lt;/i&gt; slow. )&lt;/li&gt;&lt;li&gt;Reading/writing the dataref, where data is transferred.  This operation is officially "fast"; Sandy and I keep a close eye on how much code happens inside the dataref read/write path and forgo heavy validation.  The motivation here is: we're not going to penalize well-written performance-critical plugins with validation on every write because other plugins are badly written.  Instead the failure case is indeterminate behavior, including but not limited to program termination.  (I'm not ruling out &lt;a href="http://catb.org/jargon/html/N/nasal-demons.html"&gt;nasal demons&lt;/a&gt; either!)&lt;/li&gt;&lt;/ul&gt;This notion of "protecting the fast path" (that is, making sure the fastest possible code is as fast as possible) serves as a good guideline in understanding both C and C++ language design; in most cases, given a choice, C/C++ protect the fast path, rather than protecting, well, you.&lt;br /&gt;&lt;br /&gt;A simple example: case statements.  Case statements have this nasty behavior that they will "flow through" to the next statement if break is not included.  99% of the time, this is a programmer error, and it would be nice (most of the time) if the language disallowed it.  But then we would lose this fast path:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;switch(some_thingie) {&lt;br /&gt;case MODE_A:&lt;br /&gt;       do_some_stuff();&lt;br /&gt;case MODE_B:&lt;br /&gt;       do_shared_behavior();&lt;br /&gt;}&lt;/code&gt;&lt;/blockquote&gt;In this case, where we want specialized behavior and then common behavior in mode A, but only the common behavior in mode B, flow-through lets us write ever so slightly more optimal code.&lt;br /&gt;&lt;br /&gt;If this seems totally silly now, in a world where optimizers regularly take our entire program, break them down into subatomic particles, and then reconstitute them as chicken nuggets, we have to remember that C was designed in the 70s on machines where the compiler barely could run on the machine due to memory constraints; if the programmer didn't write C to produce optimal code, there wasn't going to be any optimal code.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-8983720345952455673?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/8983720345952455673/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/fast-paths.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8983720345952455673'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/8983720345952455673'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/fast-paths.html' title='Fast Paths'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-3685423689480184275</id><published>2010-01-05T14:30:00.003-05:00</published><updated>2010-01-05T14:45:53.206-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='c++'/><category scheme='http://www.blogger.com/atom/ns#' term='Quotes'/><category scheme='http://www.blogger.com/atom/ns#' term='Software Development'/><title type='text'>Coding For Two Audiences</title><content type='html'>I'm going to keep going with the "pithy one-liner thing", because obviously computer programming can be completely reduced to drinking coffee, cursing, and a few sentences you can write on the back of your hand.  Okay, here goes:&lt;br /&gt;&lt;blockquote&gt;All code is written for two audiences.&lt;/blockquote&gt;Ha!  You could put that in a fortune cookie.  Seriously though, that is the truth, and it is the driving motivation behind my style guidelines for headers.&lt;br /&gt;&lt;br /&gt;The first audience for your code is, of course, the compiler.  The compiler is a tool that writes your application - your code is a set of instructions to the compiler about what you want it to do.  Since compilers aren't very creative (at least we hope) you have to be very precise, and the compiler tends to be very picky.  A compiler gets all bent out of shape when you write things like:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;viod set_flaps(int position);&lt;/code&gt;&lt;/blockquote&gt;No imagination, those compilers.  They also aren't real good at catching things like this:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;if (x=0) init_subsystem();&lt;/code&gt;&lt;/blockquote&gt;(Not quite fair - compilers now do catch some of the more knuckleheaded things you can do - but look at the C++-tagged posts in this blog for examples of what the compiler thinks isn't a bad idea.)&lt;br /&gt;&lt;br /&gt;So lots of books have been written about how to write code that won't confuse the compiler and you'll find engineers who insist on writing if (0 == x) and such.  That serves the first audience well.  But what of the second audience?&lt;br /&gt;&lt;br /&gt;The second audience is the humans who will have to read the code in the future in order to use or change it.  That includes future you, so for your own sake, be nice to this audience.  Code says something to people, not just to compilers.  Consider this:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;typedef void * model_3d_ref;&lt;br /&gt;model_3d_ref load_model_from_disk(const char * absolute_file_path);&lt;br /&gt;void draw_model(model_3d_ref the_model, float where_x, float where_y, float where_z);&lt;br /&gt;void deallocate_model(model_3d_ref kill_this);&lt;/code&gt;&lt;/blockquote&gt;Without knowing what the hell we're doing, if you know C and have worked as a computer programmer a few years, you probably already have a rough idea of what I'm trying to do with those declarations.  Humans read code, and humans infer things from the code that will be necessary to work on it.&lt;br /&gt;&lt;br /&gt;The compiler doesn't read your code like this - the following code is exactly the same to a compiler:&lt;br /&gt;&lt;blockquote&gt;&lt;code&gt;void * load_model_from_disk(const char *);&lt;br /&gt;void draw_model(void *, float, float, float);&lt;br /&gt;void deallocate_model(void *);&lt;/code&gt;&lt;/blockquote&gt;As humans though, the above is a lot more like gibberish.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Header Nazi&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;And that is why I am a header Nazi.  Here's how I do the math: if you write code that is useful, bug free, and reasonably well encapsulated/insulated, then people are going to spend a lot more time looking at the header to understand the interface than they will spend looking at the implementation.  (In fact, it should be unnecessary to look at the implementation at all to use the code.)&lt;br /&gt;&lt;br /&gt;For this reason, I want my headers to be clean, clean, clean.  I want them to read like a book , because that's what they are: the user's manual for this module to the humans who will use it.  This tick comes out in a few forms:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;I prefer physical insulation (putting code in the cpp file) to logical encapsulation (putting things in the private: part of an object) because it gets the implementation details out of sight.  It keeps the human readers from being distracted by how the module works, and helps keep inexperienced programmers from mistaking implementation for interface.&lt;/li&gt;&lt;li&gt;If I have to inline for performance, I keep the inline out-of-class at the bottom of the header so it doesn't detract from readability.&lt;/li&gt;&lt;li&gt;Bulk comments about usage go in the header to form a document.&lt;/li&gt;&lt;li&gt;Any semantics about calling conventions go in the header so that examining source is not necessary.&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/6042417775578107106-3685423689480184275?l=hacksoflife.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://hacksoflife.blogspot.com/feeds/3685423689480184275/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/coding-for-two-audiences.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3685423689480184275'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/6042417775578107106/posts/default/3685423689480184275'/><link rel='alternate' type='text/html' href='http://hacksoflife.blogspot.com/2010/01/coding-for-two-audiences.html' title='Coding For Two Audiences'/><author><name>Benjamin Supnik</name><uri>http://www.blogger.com/profile/04886313844644521178</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-6042417775578107106.post-9210747718047012197</id><published>2010-01-02T12:18:00.005-05:00</published><updated>2010-01-02T12:50:15.812-05:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Software Development'/><category scheme='http://www.blogger.com/atom/ns#' term='Rants'/><title type='text'>When To Rewrite</title><content type='html'>If one thing drives me crazy, it is reading claims in the flight simulator community that FS X needs "a total rewrite".  Now FS X is our (now EOLed, at least temporarily) competition, but people have made the same claim about X-Plane, and it is just as  stupid for FS X now as it was for X-Plane then.  The users who claim a rewrite is needed are quite possibly not software engineers and certainly don't have access to a proprietary closed source code base, which is to say, they are completely unqualified to make such a claim.  But "let's do a total rewrite" does persist as a real strategy in the computer industry - I have been on teams that have tried this, and I can say with some confidence: it is a terrible idea.  To claim that 100% of the software should be thrown out is to fail to understand how software companies make money.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.joelonsoftware.com/articles/fog0000000069.html"&gt;Joel's treatment&lt;/a&gt; on the subject is thorough and clear.  I would only add that beyond the intensely poor return on investment of a total rewrite (e.g. spending developer time to replace field tested, proven code that users like with a track record for making money with new untested code that may be buggy without adding new features), the actual dynamics of a rewrite are even worse in practice.  This is how I would describe the prototypical rewrite:&lt;br /&gt;&lt;br /&gt;Software product X is first developed by a small team of grade A programmers - programmers who understand what they are doing completely, can ship product, fully chase down bugs, and understand the trade-offs of architecture vs. ship date.  These programmers maybe don't always write the cleanest code, but when they write something dirty, they know why it's dirty, what they will do about it, and at what point it will make sense from a business standpoint to fix it.  (And the fact that the "dirty" code shipped means: that time to fix the problem hasn't come yet.)&lt;br /&gt;&lt;br /&gt;Once the product starts making money, the team grows, and the product goes into a feature mode - new versions get new features added into the code.  The business model is to sell upgrades by putting features into the code on a timely basis.  This is where things start to get tricky:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;The business model rewards shipping new features.  Thus the metric that the company should be looking at is "efficiency", e.g. how many man-months to get a feature valued at some number of dollars?&lt;/li&gt;&lt;li&gt;There is an opportunity cost to not shipping features, thus the team has been increased in size with "grade B" developers.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Now management has a serious problem: if the efficiency of the team is declining, is it because the grade B developers aren't as efficient (a known and acceptable risk) or because the code is becoming harder to work with?&lt;/li&gt;&lt;/ul&gt;Every feature is different, and it's likely that the original "A" team is working on the hardest features - the ones only they can do.  So isolating and detecting that your code base is becoming fugly is going to be nearly impossible by management.  If you have management by metrics (e.g. a management team that uses proxy metrics like bug count, KLOC and other such things but doesn't actually look at &lt;i&gt;what the code says&lt;/i&gt;) they are not going to have any tools to recognize the problem.  Combine that with the fact that every developer says every piece of code not written by himself/herself within the last 3 days is fugly, and management just doesn't know the extent of the problem.&lt;br /&gt;&lt;br /&gt;Is the code base getting worse at this point?  Almost certainly yes!&lt;br /&gt;&lt;ul&gt;&lt;li&gt;If the original design was business-optimal, it did not contain a bunch of code to make future expansion easy.  (Side note: this is the right decision and this problem of architectural drift should &lt;i&gt;not&lt;/i&gt; be solved by making the "grand design" in version 1.  No one knows what features will actually be useful in version 2, so a "grand designed" version 1 is going to have a ton of crap that will never get productized and just take longer to ship in the first place.)&lt;/li&gt;&lt;li&gt;If the business model can't track efficiency and code quality, then the A team (the only ones capable of rearchitecting the design) are under strong pressure not to do so.  In fact, they're getting the hardest problems and are probably critical path in every release; asking them to rearchitect to will seem like an impossibility.&lt;/li&gt;&lt;li&gt;The B team doesn't understand the design, and thus every feature they're putting in is probably screwing up the program 
