Tuesday, March 17, 2015

Accumulation to Improve Small-Batch Drawing

I sometimes see "casual" OpenGL ES developers (e.g. users making 2-d games and other less performance intensive GL applications) hit a performance wall on the CPU side. It starts with the app having something like this:
class gl_helper {
  void draw_colored_triangle_2d(color_t color, int x1, int y1,
    int x2, int y2, int x3, int y3);
  void draw_textured_triangle_2d(color_t color,
    int x1, int y1, int x2, int y2,
    int x3, int y3,
    int tex_x, int tex_y, int tex_width, int tex_height);
  void draw_textured_triangle_3d(color_t color,
    int x1, int y1, int z1, int x2, int y2, int z2,
    int x3, int y3, int z3, int tex_x,
    int tex_y, int tex_width, int tex_height);
};
You get the idea.  OpenGL ES is "tamed" by making simple functions that do what we want - one primitive at a time.

The results are correct drawing - and truly awful performance.

Why This Is Slow

Why is the above code almost guaranteed to produce slow results when implemented naively? The answer is that 3-d graphics hardware has a high CPU cost to set the GPU up to draw and a very low cost per triangle once you do draw.  So creating an API where each triangle comes in differently and thus must be individually set up maximizes the overhead and minimizes throughput.

A profile of this kind of code will show a ton of time in the actual draw call (e.g. glDrawArrays) but don't be fooled.  The time is really being spent at the beginning of glDrawArrays synchronizing the GPU with the type of drawing you want.*

Cheaper By the Dozen

The Mike Acton way of fixing this is "where there's one, there's many" - this API should allow you to draw lots of triangles, assuming they are all approximately the same.  For example,
void draw_lots_of_colored_triangles(color_t color, int count, float xyz[]); 
would not be an insane API.  At least if the number of triangles gets big, the overhead gets small.

One thing is clear: if your application can generate batched geometry, it absolutely should be sending it to OpenGL in bulk!  You never want to run a for-loop over your big pile of triangles and send them one at a time; if you have a wrapper around OpenGL, make sure you can send the data in without chopping it up first!

When You Can't Consolidate

Unfortunately there are times when you can't actually draw a ton of triangles all at once. It's cute of me to go "oh, performance is easy - just go rewrite all of your drawing code", but this is time consuming and in some cases the app structure itself might make this hard. If you can't design for bulk performance, there is a second option: accumulation.

The idea of accumulation is this: instead of actually drawing all of those individual triangles, you stash them in memory.  You do so in a format that makes it reasonably quick to both:

  1. Save the triangles (so you don't waste time saving and)
  2. Send them all to OpenGL at once.
Here's where the performance win comes from: the accumulator can see that the last 200 triangles were all color triangles with no texture, so it can send them to the GPU with one state setup (for non-textured triangles) and then a single 200-triangle draw call.  This is about 200x more efficient than the naive code.

The accumulator also gives you a place to collect statistics about your application's usage of OpenGL.  If your app is alternating colored and textured triangles, you're going to have to change shaders (even in the accumulator) and it will still be slow.  But you can record statistics in debug mode about the size of the draws to detect this kind of "inefficient ordering."

Similarly, the accumulator can eliminate some calls to the driver to setup state because it knows what it was last doing.  The accumulator does all of its drawing in one shot; if you draw two textured triangles with different textures, the accumulator must stop to change textures (not so good), but it can go "hey, another textured triangle, same pixel shader" and avoid changing pixel shaders (a big win).

Dealing With Inefficient Ordering

So now you have an accumulator, it submits the biggest possible batches of the same kinds of triangles, and it makes the minimum state change calls when the drawing type changes.  And it's still slow. When you look at your usage stats, you find the average draw call size is still only two triangles because the client code is alternating between drawing modes all of the time.

(Maybe your level's building block consists of a textured square background with an additively blended square on top, and this means two triangles of background, state change, two triangle of background, state change again.)

I am assuming that you have already combined your images into a few large textures (texture atlasing) and that you don't have a million tiny textures floating around.  If you haven't atlased your textures, go do it now; I'll wait.


Okay welcome back. When your drawing batch size is still too small even after accumulation, you have two tools to get your batch size back up.

Draw Reordering

The first trick you can try (and you should try this one first) is to give your accumulator the freedom to reorder drawing to achieve better performance.

In our example above, every square in the level had two draws, one on top of the other, and they weren't in the same OpenGL mode.  What we can do is define each draw to be in a different layer, and let the accumulator draw all of layer 0 before any of layer 1.

Once we do that, we find that all of layer 0 is in one OpenGL state (big draw) and all of layer 1 is in the other.  We've relaxed our ordering by giving the accumulator an idea of the real draw ordering we need, rather than the implicit one that comes from the order our code runs.

We actually had just this problem in X-Plane 10 Mobile's user interface; virtually every element was a textured draw of a background element (which uses a simple texturing shader) followed by a draw of text (which uses a special font shader that applies coloring from a two-channel texture).

The result was two shader changes per UI element, and the performance was awful.

We simply modified our accumulator to draw all text after all UI elements; there's a simple "barrier" that can be placed to force stored up text to be output before proceeding (to get major layering of the UI right) but most windows can draw all of their UI elements before any text, cutting down the number of shader changes to two changes total - a big win!

Merging OpenGL State

If you absolutely have to have the draw order you have (maybe there's alpha blending going on) the other lever you can pull is to find ways to make disparate OpenGL calls use more similar drawing state. (This is what texture atlasing does.)  A few tricks:

  • Use a very small solid white texture for non-textured geometry - you can now use your texturing shader at all times.
  • You don't need to get rid of color application in a shader - simply set the color to white opaque.
  • If you use pre-multiplied alpha, you can draw both additive and non-additive alpha from the same state by varying how you prepare your art assets. Opaque assets can be run with the blender on.
In most of these cases, performance is potentially being lost, so you need to be sure that the cost of the small batching and specific draw order needs outweighs the cost of not doing the most efficient thing.  The small white texture should be pretty cheap; GPUs usually have very good texture memory caches.  Blending tricks can be very expensive on mobile GPUs, and old mobile GPUs are very sensitive to the length of the pixel shader, so you only want to leave color on if it's in the vertex shader.

The point of the above paragraph is: measure carefully first, then merge state second; merging state can be a win or a loss, and it's very dependent on the particular model you're drawing.


* Most drivers defer the work of changing the GPU's mode of drawing until you actually say draw. This way it can synchronize the net result of all changing, instead of making a single change each time you call an API command.  Since the gl calls you make don't fit the hardware very well, waiting until the driver can see all changes is a big win.

Saturday, March 14, 2015

glNext is Neither OpenGL nor Next, Discuss

The title is a stretch at best, but as I have said before, good punditry has to take precedence over correctness. Khronos posted a second version of the glNextVulkan + SPIR-V talk with good audio and slices. I'll see you in an hour and a half.


That answered all of our questions, right!  Ha ha, I kid. Seriously though, at least a little bit is now known:
  • Vulkan is not an incremental extension to OpenGL - there's no API compatibility. This is a replacement.
  • Vulkan sits a lot lower in the graphics stack than OpenGL did; this is an explicit low level API that exposes a lot of the hard things drivers did that you didn't know existed.
The driver guys in the talk seem pretty upbeat, and they should: they get to do less work in the driver than they used to! And this is a good thing; the surface area of the OpenGL API (particularly when you combine ARB_compatibility with all of the latest extensions) is kafkaesque. If someone showed you the full API and said "go code that" you'd surely offer to cut off a finger as a less painful alternative.

My biases as a developer are in favor of not throwing out things that work, not assuming that things need a from scratch rewrite just because they annoy you, and not getting excited just because it's shiny and new.  So I am surprised with myself that at this point, I'd much rather have a Vulkan-like API than all of the latest OpenGL extensions, even though it's more work for me. (Remember that work the driver guys aren't going to do?  It goes into the engine layer.)

What's Good/Why Do We Need This?

While there's a lot of good things for game engines in Vulkan, there are a few that got my attention because they are not possible with further extension/upgrade to OpenGL:

Threading: A new API is needed because OpenGL is thread-unfriendly, and it's unfriendly at the core of how the API is written; you can't fix this by adding more stuff. Some things OpenGL does:
  • OpenGL sets up a 1:1 correspondence between queues, command buffers, and threads.  If you want something else, you're screwed, because you get one thing ("the context") and it has damned strict threading rules.
  • OpenGL does the thread synchronization for you, even if you don't want that.  There are locks inside the driver, and you can't get rid of them.*
With Vulkan, command buffers and queues are separate, resource management is explicit, and no synchronization is done on your behalf.

This is definitely a win for game engines. For example, with X-Plane we will load a scenery tile "in the background". We know during loading that every object involved in the scenery tile is "thread local" to us, because they have not been shared. There is no common data between the rendering thread and the loader.

Therefore both can run completely lock free.  There is a one-time synchronization when the fully finished tile is inserted into the active world; this insert happens only after the load is complete (via message Q) and is done between frames by the rendering thread.  Again, no locks.  This code can run lock free at pretty much all points.

There's no way for the GL driver to know that. Every time I go to shovel data into a VBO in OpenGL, the driver has to go "I wonder if anyone is using this?  Is this call going to blow up the world?"  Under Vulkan, the answer is "I'm the app, trust me."  That's going to go a lot faster.  We're getting rid of "safety checks" in the driver that are not needed.

Explicit Performance: one of the hardest things about developing realtime graphics with OpenGL is knowing where the "fast path" is.  The GL API lets you do about a gajillion different things, and only a few call paths are going to go fast.  Sometimes I see threads like this on OpenGL mailing lists:
Newb: hey AMD, when I set the refrigerator state to GL_FROZEN_CUSTARD and then issue a glDrawGizmo(GL_ICECREAM, 10); I see a massive performance slow-down. Your driver sucks!
I'm sitting in front of my computer going "Oh noooooes!!!  You can't use ice cream with frozen custard - that's a crazy thing to do."  Maybe I even write a blog post about it.

But how the hell does anyone ever know?  OpenGL becomes a game of "write-once performance tune everywhere" (or "write once, harass the driver guys to run your app through vtune and tell you you're an idiot everywhere") - sometimes it's not possible to tell why something is slow (NVidia, I'm looking at you and your stripped driver :-) and sometimes you just don't have time to look at every driver case (cough cough, Intel, cough).

OpenGL doesn't just have a huge API, it has a combinatorially huge API - you can combine just about anything with anything else; documenting the fast path (even if all driver providers could agree) is mathematically impossible.

Vulkan fixes this by making performance explicit.  These functions are fast, these are slow.  Don't call the slow calls when you want to go fast.  It gives app developers a huge insight into what is expensive for the driver/hardware and what is not.

Shim It: I may do a 180 on this when I have to code it, but I think it may be easier to move legacy OpenGL apps to Vulkan specifically because it is not the OpenGL API.

When we had to port X-Plane 9 for iPhone from GLES 1.1 to GLES 2.0, I wrote a shim that emulated the stuff we needed in GLES 1.1  Some of this is now core to our engine (e.g. the transform stack) and some still exists because it is only used in crufty non-critical path code and it's not worth it to rip it out (e.g. glBegin).

The shimming exercise was not that hard, but it was made more complicated by the fact that half of the GL API is actually implemented in both versions of the spec.  I ended up doing some evil macro trickery: glDrawElements gets #defined over to our internal call, which updates the lazily changed transform stack and then calls the real glDrawElements.  Doing this level of shim with the full desktop GL API would have been quite scary I think.

Because Vulkan isn't gl at all, one option is to simply implement OpenGL using...Vulkan. I will be curious if a portable open source  gl layer emerges; if it does, it would be a useful way for very large legacy code bases to move to Vulkan.  There'd be two wins:

  1. Reliability.  That's a lot less code that comes from the driver; whether your gl layer works right or is buggy as a bed in New York, it's going to be the same bugs everywhere - if you've ever tried to ship a complicated cross-platform OpenGL app, having the same bugs everywhere is like Christmas (or so I'm told).
  2. Incremental move to Vulkan.  Once you are running on a GL shim, poke a hole through it when you need to get to the metal for only performance-critical stuff.  (This is what we did with GLES 1.1/2.0: the entire UI ran in GLES 1.1 emulation and the rendering engine went in and bound its own custom shaders.)

Vulkan is Not For Everyone

When "OpenGL is Borked" went around the blogs last year one thing that struck me was how many different constituencies were grumpy about OpenGL, often wanting fixes that could not co-exist. Vulkan resolves this tension: it's low level, it's explicit, it's not backward compatible, and therefore it's only for developers who want to do more work to get more perf and don't need to run on everything or can shim their old code.

I think this is a good thing: at least Vulkan can do the limited task it tries to do well. But it's clearly not for beginners, not for teaching an introduction to 3-d graphics, and if you were grumpy about how much work it was to use GLES 2.0 for your mobile game, Vulkan's not going to make you very happy.  And if you're sitting on 100,000,000 lines of CAD code that's all written in OpenGL, Vulkan doesn't do you as much good as that one extension you really really really need.

For developers like me (full time, professional, small company, proprietary engine) there's definitely going to be a cost in moving to Vulkan in development time. Whenever the driver guys talk about resource management they often say something like:
The app has to do explicit resource management, which it's probably already doing on console.
for the big game engines this is totally true, so being able to re-use their resource management code is a win. For smaller games, OpenGL is their resouce management code.  It's not necessarily very good resource management (in that the GL driver is basically guessing about you want, and sometimes guessing wrong) but if you have a three-person development team, having Graham Sellers write your resource management code for you for free is sort of an epic win.

Resource management is the one area where what we know now is way too fuzzy. You can look at Apple's Metal API (fully public, shipping, code samples) and see what a world with non-mutable objects, command queues and command buffers looks like. But resource management in Metal is super simple because it only runs on a shared memory device: a buffer object is a ptr to memory, full stop.  (Would if it were that easy on all GPUs.)

It's too soon to tell what the "boiler plate" will look like for handling resource management in Vulkan.  There's a huge difference in the quality of resource management between different driver stacks; writing a resource manager that does as well as AMD or NVidia's is going to be a real challenge for small development teams.



* My understanding is that if you create only one GL context (and thus you are not using threads) the driver will actually run in a lock-free mode to avoid over-head.  The fact that the driver bothers to detect that and special case it gives you some idea how crazy a GL driver is.  If that doesn't, read this.

Thursday, February 26, 2015

The Ambiguously Overloaded Operator

It's been a while since we've had a good CGAL error. Went to upgrade to CGAL 4.5.2 and got this:
/Volumes/RAID/code/xptools/src/XESCore/MapPolygon.cpp:22:0 /Volumes/RAID/code/xptools/src/XESCore/MapPolygon.cpp:22: error: ambiguous overload for 'operator+=' in '* extent += ((const CGAL::Point_2 > >, true> >*)circ.CGAL::_HalfedgeDS_facet_const_circ > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Halfedge, CGAL::I_Filtered_const_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::_Is_valid_halfedge, CGAL::internal::In_place_list_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Halfedge, int, std::bidirectional_iterator_tag>, CGAL::Bidirectional_circulator_tag>::.CGAL::I_Filtered_const_iterator::operator-> [with CIterator_ = CGAL::internal::In_place_list_const_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, Filter_ = CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::_Is_valid_halfedge, MIterator_ = CGAL::internal::In_place_list_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, Value_ = CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Halfedge, Diff_ = int, Category_ = std::bidirectional_iterator_tag]()->CGAL::Arrangement_on_surface_2::Halfedge::source [with GeomTraits_ = CGAL::Gps_segment_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, TopTraits_ = CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> >]().CGAL::I_Filtered_const_iterator::operator-> [with CIterator_ = CGAL::internal::In_place_list_const_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, Filter_ = CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::_Is_concrete_vertex, MIterator_ = CGAL::internal::In_place_list_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, Value_ = CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Vertex, Diff_ = int, Category_ = std::bidirectional_iterator_tag]()->CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Vertex::.CGAL::Arr_vertex > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >::.CGAL::Arr_extended_vertex > >, true> > >, GIS_vertex_data>::.CGAL::Arr_vertex_base::point [with Point_ = CGAL::Point_2 > >, true> >]())->CGAL::Point_2::bbox [with R_ = CGAL::Filtered_kernel > >, true>]()'
I am pretty sure the actual problem is trivial and obvious, but it'll be easier to fix if I ignore the error message.

Saturday, January 31, 2015

The Four Horsemen of the Performance Apocalypse

In my previous post, I made the claim that really high performance software is designed to be high performance, and that a two-step process of writing code without consideration of performance and then trying to "fix it later" (guided by performance tools to show you the worst performance problems) would not work.

That's a strong claim to make, particularly given how many very smart people are saying things like "profile first, optimize later" (which is better, I guess, than "write code, don't profile, randomly try to make it faster by poking it with a stick").  So I want to make the case that there are specific factors which hurt performance, and that they actually tend to get settled a lot earlier in the process of writing software than we might think.

I am also trying to come up with enough pithy quotes and catchily-named coding ideas that, when I become unemployable due to my ever-spreading gray-hair, I will be able to write some kind of book of coding wisdom and transition into a career of tech punditry.  I have already figured out how to make 3-d transitions in Keynote!

When Does Performance Matter

We all know that 100% of our software does not need to be optimized.  So let me be clear about the scope where these ideas apply: sometimes (perhaps often) software's performance matters to its functionality.
  • When you open an application, how long do you have to wait before the application is ready for you to use?
  • When you interact with an editor, how long does it take for the editor to respond to each edit? Does the UI feel slow and unresponsive due to this speed?  If you hit undo, do you have to wait while the undo takes place?  Do you lose track of what you were doing because it took so long?
  • When you scroll around a document, is the scrolling smooth and fluid, or can you see the screen refreshing only 2 times a second?  Does it feel like "work" to move the document because tracking your scroll motion is so unresponsive?
These examples are close to home - they are all from software I have worked on, and in some cases, they are examples of bad performance that I am directly responsible for.  There's code I have worked on where the performance is very good, and there is code that I have worked on where the performance is lousy; in the set of posts that follow, I'll try to describe what works and what doesn't from experiences both successful and embarrassing.

If your dialog box opens in 30 ms, I don't care.  It might be really lame that a dialog box takes 30 ms on a 3 ghz machine to appear, but opening a dialog box isn't a hard problem.  But there are plenty of things we do with computers where performance absolutely does matter.  It may be necessary to make the user experience pleasant, or it may be necessary to allow a user to reach their full potential creativity while using your program, or it may matter for allowing your application to run on a wider range of hardware (e.g. mobile phones).

So as we proceed, think about parts of your code base where performance scalability does matter.  If it's a source control program, how fast is it to change branches?  If it's a flight simulator, how many 3-d buildings can you show in a city?  If it's an audio program, how many audio tracks and edits can a user make and still play back the composition?  If it's a lego modeling program, how many lego bricks can the user add to a creation?

Meet the Four Horsemen

Now that we know that we're working on some kind of "core" problem and not the UI code for preferences, let's meet the four horsemen of the performance apocalypse.  It looks like Google returns zero search results for that query, so on with the punditry!  I really only have three horsemen and a qualifying statement, but "pithy" has to trump "accuracy" if we're going to get anywhere.

The four horsemen are:
  1. Doing Unnecessary Work.
  2. Ignoring constant time inefficiencies.
  3. Unnecessary generalization.
  4. Compound Interest.
That last one is more finance than computer science, and it's not really a form of bad code, but it might be the most important factor of all.

1. Doing Unnecessary Work

The first way to have slow code instead of high performance code is to do unnecessary work.  All of that big-O notation ("your sort algorithm is O(N^2), dumbass") falls into this bucket.  If your code is going to be high performance, it needs to do what needs to be done but not more than what needs to be done.

Stupid examples:
  • When your application has to re-draw a table, does it fetch data for the entire table instead of just the part of the table the user can see?  Those extra fetches are unnecessary!
  • When a value is computed and used over and over, do you save the computation or just re-compute from scratch each time?  All but the first computation might be unnecessary.
  • When you draw with OpenGL, do you change OpenGL state between draws even though the state you need is already the current state?  Those "redundant" state change are unnecessary, and the massive thrash inside the driver that is caused by this is definitely unnecessary.
  • When your editor program edits a large number of items in a document, do you update the UI for every item, or just once at the end?  If the former, then all but the last UI update might be unnecessary.
In the domain of unnecessary work, the algorithm is king - the wrong algorithm can cause the amount of work done to increase by a huge factor (10x!  100x!).  Pick data structures that enable the best algorithms.

When it comes to unnecessary work, think bigger than algorithmic efficiency:
  • In your game, do you need to shade the terrain using the GPU?  Or could you have shaded the terrain ahead of time and saved the shading to disk? If so, all of that shading is unnecessary! "Ahead of time" is often the best time to do work.
  • In your racing game, do you need houses everywhere on the level, or only really close to the track where the driver can see them?  If you put houses everywhere, the work spent to load them from disk, cull them, and possibly render them is all unnecessary.
Avoiding unnecessary work can go all the way back to the process, in defining what the application actually does and how it solves the problem.  X-Plane 10 mobile lets the user pick specific times of day on Earth.  Because we don't need flexibility to leave the atmosphere, go to other planets, or set arbitrary times of the day or year, a full simulation of the Earth's atmosphere under all conditions is unnecessary work that we can avoid on mobile devices.

Is avoiding unnecessary work something that we can save until the end of the development process? Clearly the answer is no.  Avoiding unnecessary work is about determining how your program solves a user's problem and about selecting the algorithms and design constraints under which you will do this. This is stuff you do early in the design process!

2. Ignoring Constant Time Inefficiencies

Avoiding unnecessary work is critical to good performance, but there will be some work your program must do, and when your program computes, it needs to not totally suck in the process. Algorithms often discuss the relationship between input data size and compute time (Big-O notation) but the constant factor - how fast each item is processed - really matters.

Paying attention to constant time is all the rage now - if you've heard Mike Acton ranting about inefficient node-based game engines or Chandler Carruth telling you to stop using std::map, you've heard discussion of constant-time factors.  Examples of constant-time fail:
  • Dynamically allocating data with malloc when it could be stored statically (on-stack, globally, or in-class).
  • Using std::map when a sorted std::vector will work.  The map has terrible locality and has to call malloc a gajillion times.
  • Using virtual functions when static free functions or inline functions would do.
  • Using dynamic_cast when static cast would work.
  • Using memory structures that have poor cache locality.
I can think of two possible reasons why constant-time factors have become so important in the game development community (and why, when the game-nerds talk about them, general desktop app programmers sometimes think the game-nerds are nuts):
  1. The gap between CPU and memory performance is widening, so the "little" things aren't so little anymore.  If a cache miss to main memory takes 300 cycles while the L1 cache takes 3 cycles, then the "constant" in the constant time factor of cache locality is 100x!!!!!
  2. Some of the constant time factors that are not so bad on desktop (where we have super-high performance chips that do aggressive things to hide latency) are still quite bad on consoles.  I experienced this myself when working on mobile: the sloppy code I would barf out on desktop would run "okay" on a Mac Pro but would run terribly on an iPhone 4S.
Here's one reason why I think constant time factors can be really important: we tend to use the same set of tools for most of the code we write.  The result is that slow constant time techniques tend to get into 100% of our code; ripping out a slow constant time technique and fixing it isn't a matter of a "hot spot" - it can be a complete app rewrite.

The string map< occurs 588 times in the WED code base.  set< occurs 822 times!  Approximately 100% of this code is running a lot slower than it could be.  This performance problem is the cumulative result of using slow-constant-time coding techniques for years.

So if we're going to avoid slow constant time techniques, we need to figure out what is inefficient and find an efficient replacement.  And we need to do this early on, so that we can use the right idiom in our code, not the wrong one.  This doesn't sound like something that can be done after coding is complete, in a late-stage optimization pass.

3. The Weight Of Unnecessary Generalization

We're programmers.  We generalize things; it's just what we do.  But when we generalize code unnecessarily, performance goes out the window.

Unnecessary generalization is using an algorithm to take the intersection of two arbitrary polygons with holes when all we really want to do is clip a triangle against an axis-aligned bounding box.

Unnecessary generalization is making a "drawing" function that draws a universal triangle (a triangle with color, texture and position) and then making every single part of the application use that canonical call.

Unnecessary generalization is when your selected solution to a problem doesn't fit the problem like a glove.

Unnecessary generalization can come in the form of a "simplifying abstraction".  WorldEditor redraws the entire screen when any part of the screen needs redrawing.  This is a simplifying abstraction - none of the code needs to worry about keeping track of dirty regions or invalidating the right parts of the screen.  We solve the general problem of filling the whole screen instead of the specific (but more complex) problem of filling in specific areas.  You can see the performance problem here: the general form of the problem does unnecessary work because the optimization we could have taken with the specific version of the problem is lost.

There are lots of ways to create overly general code:
  1. Using an overly general solution to a specific problem.  The programmer thinks he or she is making the code "future proof", but really the programmer is simply making Mike Acton even more grumpy.
  2. Solving a simpler problem that is more computationally expensive.
  3. Applying simplifying (but performance-expensive) abstractions between parts of the program that make the direct expression of specific solutions impossible.
This last kind of generalization weight is particularly evil because it's hard to fix after the fact.  Once you have an abstraction in place, lots of code ends up being written in terms of that abstraction. Getting rid of the abstraction can mean a huge rewrite.

Can we fix overly general code after the fact?  Probably not without a rewrite.  When the general code is in the form of an interface (the drawing module's interface is "I will draw one textured triangle") then huge amounts of code can require a serious rewrite to fix the problem.

4. Compound Interest

This is the last and most important problem: performance problems compound.  If you have three generalizations and each makes your code run 25% slower, your code is now almost twice as slow as without them.  Generalizations weigh down on top of constant time penalties, and that slow code then has to plow through too much data.  It's easy to dismiss all of these problems (doing more work than necessary, slow constant-time factors, and unnecessary generalization) because any one of these things might not be that bad.

Compounding is why a 1% fee is a big deal in finance and why a 1% performance gain or loss is a big deal for driver writers.

You can say "oh, well, if anything's really bad I'll pick it off with a profiler and fix it".  But because of compounding, your code might be slow due to a large number of small factors, and it might be a lot of coding to fix them all. (This is the dreaded "uniformly slow code" case where every function takes 1% of CPU and you have to make one hundred separate performance optimizations to improve performance at all.)

Good Performance Comes From Design

This brings me to the point I originally wanted to make about Knuth and premature optimization before Joshua beat me to it: if a program has good performance, it is because somebody designed it to be that way.

Good performance comes from keeping multiple performance problems from compounding.  Look at our sources of poor performance:
  • Doing unnecessary work.  This has to get fixed at algorithm design time and even product definition time. If you have the wrong algorithm, that's a bad design choice.
  • Using slow coding idioms.  This permeates all of your code, so rewriting with faster technique later is going to be expensive.
  • Including generalizations that are unaffordable and unnecessary.  Once you depend on the abstractions and generalizations, removing them are painful.
These are all up-front considerations!

In Knuth's original paper he suggests:
that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off. 
But this isn't enough.  It's not enough to measure how fast code is, and then change it.  To make a program meet it's performance guidelines you need to design for performance.
  • Understand the size of your data sets and the scope of the computation needed; select an algorithm that can solve the problem within resources.
  • Pick coding idioms that will implement that algorithm "on budget".
  • Add only abstractions that add value and don't penalize your performance target.
Optimization before you have done these things is premature!  But then, so is writing production code.

Sunday, January 04, 2015

High Performance Code Is Designed, Not Optimized

Sigh...this is what I get for pretending to do work when I could have been writing blog posts and changing my blogger theme a few dozen times.  I was totally going to write a big rant about how it's not okay to use Knuth as an excuse to pull code straight out of your ass and then hope to fix performance at the last minute by repeatedly finding and beating on the hot spots with a profiler.

But Joshua Barczak totally beat me to it.  So, um, go read what he read.

I have a bunch of rantstopics I want to discuss in detail, but to start, here's an attempt at a pithy quote (for my future career as a tech pundit):
High performance software is always high performance by design.
I cannot emphasize this enough: high performance applications meet their performance goals by designing for performance from day one.  For all the parts of X-Plane where I am proud of our performance, I can point to a subsystem that was designed to be fast, and where performance is part of its fundamental architecture.

The Knuth quote about premature optimization isn't just often misquoted, it's incorrect.  To have high performance software, it is not enough to focus tightly on actual hot spots in the code and tune them. The architecture of the code must be designed for high performance.  A better version of the Knuth quote might be:
Premature design without analyzing the performance characteristics of the problem is the root of all evil.
Clearly I'm going to have to slim that quote down and stylize it if I'm ever going to make it as a pundit.*

Over the next few posts I will try to make the case that the factors that really affect performance (both for the better and the worst) tend to get settled very early in the design process, and therefore the notion of having high performance code by pushing performance work to late in the process is lunacy.

* I do already love the sound of my own voice blathering on, and I'm also pretty excited to spout off on topics that I know nothing about. so I figure I am at least partly qualified for the job.

Friday, December 26, 2014

OpenGL ES Performance: The iPhone 4 Performance Gap

Now that we've shipped X-Plane 10 Mobile, I can blog the occasional OpenGL ES performance note without revealing a stealth project.

X-Plane 10 Mobile requires iOS 8.  One reason why Chris set the OS requirement so high was to intentionally exclude the iPhone 4 from our device set.

For most of the project, we worked with the iPhone 4 as our minimum device, and for the entire project, it suffered performance problems that we didn't see in any of the newer devices.  The iPad 2 and iPhone 4S (the next hardware up) both performed significantly better.

I don't know what caused the gap, but it wasn't a "this phone is 20% slower" or "this ipad has 4x the shader units" kind of gap.  It was more like "this code runs fine on newer devices and we can just see the iPhone 4 trying to beat itself to death with its old-style dock connector so it doesn't have to run another frame of our main rendering loop".

I do not know what was going on under the hood, but I can share a few observations besides "If the 4 is having performance problems, it may be specific to the iPhone 4."

  • The iPhone 4 was the only device that would get bottlenecked on vertex count.  This was a real problem for us because we had models that we couldn't cut vertex count on without our artists spending a ton of time.  We had already LODed out everything expendable on the 4 and we were still getting jammed in the 3-d cockpit.
  • The iPhone 4 is very sensitive to the number of varyings and how they are packed!!!!  I found that packing fog and emissive light level into a 2-component varying significantly improved performance compared to having them as individual scalers.  (Of course, cutting the number of varyings made the biggest improvement.)
  • The iPhone 4 seemed to be spending significant driver time recycling the pool of "orphaned" buffers - that is, VBOs that had been completely discarded and respecified on a per-frame basis.
I can't say what was going on inside the driver, but I can say that all of these things were changes in the kinds of performance problems we were having, not just a matter of degree.

Once we cut the iPhone 4 and the 4S became the "runt of the litter" handling the low-end became a lot easier.  The iPhone 4S is incredibly capable for an older-generation smart-phone, and while it is the first to run out of CPU cycles or fill rate, the losses were proportional to spec, not "fall on your face and die."

I'm hoping to post a little bit more about performance tuning OpenGL ES in future posts, but from this point forward, any advice I have will apply to the 4S and above.  Having cut the iPhone 4 from our product, I no longer have time to figure out what makes it go fast.*

* One of the difficulties of OpenGL and OpenGL ES is that while the spec specifies what the APIs do, they don't specify how fast they do it.  Performance isn't guaranteed, deterministic, or often even published except in IHV presentations.  One of the big pluses of Metal (and possibly GLNext) is deterministic performance - an API that tells you what will be expensive and what won't be.

Tuesday, October 29, 2013

Good Advice From GIT

I found this in a GIT man page.  (Emphasis added by me.)
union
Run 3-way file level merge for text files, but take lines from both versions, instead of leaving conflict markers. This tends to leave the added lines in the resulting file in random order and the user should verify the result. Do not use this if you do not understand the implications.
I thought that was pretty good advice for all programming (but it does apply double to GIT).