Thursday, February 26, 2015

The Ambiguously Overloaded Operator

It's been a while since we've had a good CGAL error. Went to upgrade to CGAL 4.5.2 and got this:
/Volumes/RAID/code/xptools/src/XESCore/MapPolygon.cpp:22:0 /Volumes/RAID/code/xptools/src/XESCore/MapPolygon.cpp:22: error: ambiguous overload for 'operator+=' in '* extent += ((const CGAL::Point_2 > >, true> >*)circ.CGAL::_HalfedgeDS_facet_const_circ > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Halfedge, CGAL::I_Filtered_const_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::_Is_valid_halfedge, CGAL::internal::In_place_list_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Halfedge, int, std::bidirectional_iterator_tag>, CGAL::Bidirectional_circulator_tag>::.CGAL::I_Filtered_const_iterator::operator-> [with CIterator_ = CGAL::internal::In_place_list_const_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, Filter_ = CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::_Is_valid_halfedge, MIterator_ = CGAL::internal::In_place_list_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, Value_ = CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Halfedge, Diff_ = int, Category_ = std::bidirectional_iterator_tag]()->CGAL::Arrangement_on_surface_2::Halfedge::source [with GeomTraits_ = CGAL::Gps_segment_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, TopTraits_ = CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> >]().CGAL::I_Filtered_const_iterator::operator-> [with CIterator_ = CGAL::internal::In_place_list_const_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, Filter_ = CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::_Is_concrete_vertex, MIterator_ = CGAL::internal::In_place_list_iterator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, Value_ = CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Vertex, Diff_ = int, Category_ = std::bidirectional_iterator_tag]()->CGAL::Arrangement_on_surface_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_bounded_planar_topology_traits_2 > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, CGAL::Arr_extended_dcel > >, true>, std::vector > >, true> >, std::allocator > >, true> > > >, CGAL::Arr_consolidated_curve_data_traits_2 > >, true> >, int> >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > >, true> > >, CGAL::Arr_halfedge_base > >, true> >, CGAL::_Unique_list > >, CGAL::Gps_face_base> > >::Vertex::.CGAL::Arr_vertex > >, true> > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > >, true> >, CGAL::_Unique_list > >, GIS_halfedge_data>, CGAL::Arr_extended_face >::.CGAL::Arr_extended_vertex > >, true> > >, GIS_vertex_data>::.CGAL::Arr_vertex_base::point [with Point_ = CGAL::Point_2 > >, true> >]())->CGAL::Point_2::bbox [with R_ = CGAL::Filtered_kernel > >, true>]()'
I am pretty sure the actual problem is trivial and obvious, but it'll be easier to fix if I ignore the error message.

Saturday, January 31, 2015

The Four Horsemen of the Performance Apocalypse

In my previous post, I made the claim that really high performance software is designed to be high performance, and that a two-step process of writing code without consideration of performance and then trying to "fix it later" (guided by performance tools to show you the worst performance problems) would not work.

That's a strong claim to make, particularly given how many very smart people are saying things like "profile first, optimize later" (which is better, I guess, than "write code, don't profile, randomly try to make it faster by poking it with a stick").  So I want to make the case that there are specific factors which hurt performance, and that they actually tend to get settled a lot earlier in the process of writing software than we might think.

I am also trying to come up with enough pithy quotes and catchily-named coding ideas that, when I become unemployable due to my ever-spreading gray-hair, I will be able to write some kind of book of coding wisdom and transition into a career of tech punditry.  I have already figured out how to make 3-d transitions in Keynote!

When Does Performance Matter

We all know that 100% of our software does not need to be optimized.  So let me be clear about the scope where these ideas apply: sometimes (perhaps often) software's performance matters to its functionality.
  • When you open an application, how long do you have to wait before the application is ready for you to use?
  • When you interact with an editor, how long does it take for the editor to respond to each edit? Does the UI feel slow and unresponsive due to this speed?  If you hit undo, do you have to wait while the undo takes place?  Do you lose track of what you were doing because it took so long?
  • When you scroll around a document, is the scrolling smooth and fluid, or can you see the screen refreshing only 2 times a second?  Does it feel like "work" to move the document because tracking your scroll motion is so unresponsive?
These examples are close to home - they are all from software I have worked on, and in some cases, they are examples of bad performance that I am directly responsible for.  There's code I have worked on where the performance is very good, and there is code that I have worked on where the performance is lousy; in the set of posts that follow, I'll try to describe what works and what doesn't from experiences both successful and embarrassing.

If your dialog box opens in 30 ms, I don't care.  It might be really lame that a dialog box takes 30 ms on a 3 ghz machine to appear, but opening a dialog box isn't a hard problem.  But there are plenty of things we do with computers where performance absolutely does matter.  It may be necessary to make the user experience pleasant, or it may be necessary to allow a user to reach their full potential creativity while using your program, or it may matter for allowing your application to run on a wider range of hardware (e.g. mobile phones).

So as we proceed, think about parts of your code base where performance scalability does matter.  If it's a source control program, how fast is it to change branches?  If it's a flight simulator, how many 3-d buildings can you show in a city?  If it's an audio program, how many audio tracks and edits can a user make and still play back the composition?  If it's a lego modeling program, how many lego bricks can the user add to a creation?

Meet the Four Horsemen

Now that we know that we're working on some kind of "core" problem and not the UI code for preferences, let's meet the four horsemen of the performance apocalypse.  It looks like Google returns zero search results for that query, so on with the punditry!  I really only have three horsemen and a qualifying statement, but "pithy" has to trump "accuracy" if we're going to get anywhere.

The four horsemen are:
  1. Doing Unnecessary Work.
  2. Ignoring constant time inefficiencies.
  3. Unnecessary generalization.
  4. Compound Interest.
That last one is more finance than computer science, and it's not really a form of bad code, but it might be the most important factor of all.

1. Doing Unnecessary Work

The first way to have slow code instead of high performance code is to do unnecessary work.  All of that big-O notation ("your sort algorithm is O(N^2), dumbass") falls into this bucket.  If your code is going to be high performance, it needs to do what needs to be done but not more than what needs to be done.

Stupid examples:
  • When your application has to re-draw a table, does it fetch data for the entire table instead of just the part of the table the user can see?  Those extra fetches are unnecessary!
  • When a value is computed and used over and over, do you save the computation or just re-compute from scratch each time?  All but the first computation might be unnecessary.
  • When you draw with OpenGL, do you change OpenGL state between draws even though the state you need is already the current state?  Those "redundant" state change are unnecessary, and the massive thrash inside the driver that is caused by this is definitely unnecessary.
  • When your editor program edits a large number of items in a document, do you update the UI for every item, or just once at the end?  If the former, then all but the last UI update might be unnecessary.
In the domain of unnecessary work, the algorithm is king - the wrong algorithm can cause the amount of work done to increase by a huge factor (10x!  100x!).  Pick data structures that enable the best algorithms.

When it comes to unnecessary work, think bigger than algorithmic efficiency:
  • In your game, do you need to shade the terrain using the GPU?  Or could you have shaded the terrain ahead of time and saved the shading to disk? If so, all of that shading is unnecessary! "Ahead of time" is often the best time to do work.
  • In your racing game, do you need houses everywhere on the level, or only really close to the track where the driver can see them?  If you put houses everywhere, the work spent to load them from disk, cull them, and possibly render them is all unnecessary.
Avoiding unnecessary work can go all the way back to the process, in defining what the application actually does and how it solves the problem.  X-Plane 10 mobile lets the user pick specific times of day on Earth.  Because we don't need flexibility to leave the atmosphere, go to other planets, or set arbitrary times of the day or year, a full simulation of the Earth's atmosphere under all conditions is unnecessary work that we can avoid on mobile devices.

Is avoiding unnecessary work something that we can save until the end of the development process? Clearly the answer is no.  Avoiding unnecessary work is about determining how your program solves a user's problem and about selecting the algorithms and design constraints under which you will do this. This is stuff you do early in the design process!

2. Ignoring Constant Time Inefficiencies

Avoiding unnecessary work is critical to good performance, but there will be some work your program must do, and when your program computes, it needs to not totally suck in the process. Algorithms often discuss the relationship between input data size and compute time (Big-O notation) but the constant factor - how fast each item is processed - really matters.

Paying attention to constant time is all the rage now - if you've heard Mike Acton ranting about inefficient node-based game engines or Chandler Carruth telling you to stop using std::map, you've heard discussion of constant-time factors.  Examples of constant-time fail:
  • Dynamically allocating data with malloc when it could be stored statically (on-stack, globally, or in-class).
  • Using std::map when a sorted std::vector will work.  The map has terrible locality and has to call malloc a gajillion times.
  • Using virtual functions when static free functions or inline functions would do.
  • Using dynamic_cast when static cast would work.
  • Using memory structures that have poor cache locality.
I can think of two possible reasons why constant-time factors have become so important in the game development community (and why, when the game-nerds talk about them, general desktop app programmers sometimes think the game-nerds are nuts):
  1. The gap between CPU and memory performance is widening, so the "little" things aren't so little anymore.  If a cache miss to main memory takes 300 cycles while the L1 cache takes 3 cycles, then the "constant" in the constant time factor of cache locality is 100x!!!!!
  2. Some of the constant time factors that are not so bad on desktop (where we have super-high performance chips that do aggressive things to hide latency) are still quite bad on consoles.  I experienced this myself when working on mobile: the sloppy code I would barf out on desktop would run "okay" on a Mac Pro but would run terribly on an iPhone 4S.
Here's one reason why I think constant time factors can be really important: we tend to use the same set of tools for most of the code we write.  The result is that slow constant time techniques tend to get into 100% of our code; ripping out a slow constant time technique and fixing it isn't a matter of a "hot spot" - it can be a complete app rewrite.

The string map< occurs 588 times in the WED code base.  set< occurs 822 times!  Approximately 100% of this code is running a lot slower than it could be.  This performance problem is the cumulative result of using slow-constant-time coding techniques for years.

So if we're going to avoid slow constant time techniques, we need to figure out what is inefficient and find an efficient replacement.  And we need to do this early on, so that we can use the right idiom in our code, not the wrong one.  This doesn't sound like something that can be done after coding is complete, in a late-stage optimization pass.

3. The Weight Of Unnecessary Generalization

We're programmers.  We generalize things; it's just what we do.  But when we generalize code unnecessarily, performance goes out the window.

Unnecessary generalization is using an algorithm to take the intersection of two arbitrary polygons with holes when all we really want to do is clip a triangle against an axis-aligned bounding box.

Unnecessary generalization is making a "drawing" function that draws a universal triangle (a triangle with color, texture and position) and then making every single part of the application use that canonical call.

Unnecessary generalization is when your selected solution to a problem doesn't fit the problem like a glove.

Unnecessary generalization can come in the form of a "simplifying abstraction".  WorldEditor redraws the entire screen when any part of the screen needs redrawing.  This is a simplifying abstraction - none of the code needs to worry about keeping track of dirty regions or invalidating the right parts of the screen.  We solve the general problem of filling the whole screen instead of the specific (but more complex) problem of filling in specific areas.  You can see the performance problem here: the general form of the problem does unnecessary work because the optimization we could have taken with the specific version of the problem is lost.

There are lots of ways to create overly general code:
  1. Using an overly general solution to a specific problem.  The programmer thinks he or she is making the code "future proof", but really the programmer is simply making Mike Acton even more grumpy.
  2. Solving a simpler problem that is more computationally expensive.
  3. Applying simplifying (but performance-expensive) abstractions between parts of the program that make the direct expression of specific solutions impossible.
This last kind of generalization weight is particularly evil because it's hard to fix after the fact.  Once you have an abstraction in place, lots of code ends up being written in terms of that abstraction. Getting rid of the abstraction can mean a huge rewrite.

Can we fix overly general code after the fact?  Probably not without a rewrite.  When the general code is in the form of an interface (the drawing module's interface is "I will draw one textured triangle") then huge amounts of code can require a serious rewrite to fix the problem.

4. Compound Interest

This is the last and most important problem: performance problems compound.  If you have three generalizations and each makes your code run 25% slower, your code is now almost twice as slow as without them.  Generalizations weigh down on top of constant time penalties, and that slow code then has to plow through too much data.  It's easy to dismiss all of these problems (doing more work than necessary, slow constant-time factors, and unnecessary generalization) because any one of these things might not be that bad.

Compounding is why a 1% fee is a big deal in finance and why a 1% performance gain or loss is a big deal for driver writers.

You can say "oh, well, if anything's really bad I'll pick it off with a profiler and fix it".  But because of compounding, your code might be slow due to a large number of small factors, and it might be a lot of coding to fix them all. (This is the dreaded "uniformly slow code" case where every function takes 1% of CPU and you have to make one hundred separate performance optimizations to improve performance at all.)

Good Performance Comes From Design

This brings me to the point I originally wanted to make about Knuth and premature optimization before Joshua beat me to it: if a program has good performance, it is because somebody designed it to be that way.

Good performance comes from keeping multiple performance problems from compounding.  Look at our sources of poor performance:
  • Doing unnecessary work.  This has to get fixed at algorithm design time and even product definition time. If you have the wrong algorithm, that's a bad design choice.
  • Using slow coding idioms.  This permeates all of your code, so rewriting with faster technique later is going to be expensive.
  • Including generalizations that are unaffordable and unnecessary.  Once you depend on the abstractions and generalizations, removing them are painful.
These are all up-front considerations!

In Knuth's original paper he suggests:
that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off. 
But this isn't enough.  It's not enough to measure how fast code is, and then change it.  To make a program meet it's performance guidelines you need to design for performance.
  • Understand the size of your data sets and the scope of the computation needed; select an algorithm that can solve the problem within resources.
  • Pick coding idioms that will implement that algorithm "on budget".
  • Add only abstractions that add value and don't penalize your performance target.
Optimization before you have done these things is premature!  But then, so is writing production code.

Sunday, January 04, 2015

High Performance Code Is Designed, Not Optimized

Sigh...this is what I get for pretending to do work when I could have been writing blog posts and changing my blogger theme a few dozen times.  I was totally going to write a big rant about how it's not okay to use Knuth as an excuse to pull code straight out of your ass and then hope to fix performance at the last minute by repeatedly finding and beating on the hot spots with a profiler.

But Joshua Barczak totally beat me to it.  So, um, go read what he read.

I have a bunch of rantstopics I want to discuss in detail, but to start, here's an attempt at a pithy quote (for my future career as a tech pundit):
High performance software is always high performance by design.
I cannot emphasize this enough: high performance applications meet their performance goals by designing for performance from day one.  For all the parts of X-Plane where I am proud of our performance, I can point to a subsystem that was designed to be fast, and where performance is part of its fundamental architecture.

The Knuth quote about premature optimization isn't just often misquoted, it's incorrect.  To have high performance software, it is not enough to focus tightly on actual hot spots in the code and tune them. The architecture of the code must be designed for high performance.  A better version of the Knuth quote might be:
Premature design without analyzing the performance characteristics of the problem is the root of all evil.
Clearly I'm going to have to slim that quote down and stylize it if I'm ever going to make it as a pundit.*

Over the next few posts I will try to make the case that the factors that really affect performance (both for the better and the worst) tend to get settled very early in the design process, and therefore the notion of having high performance code by pushing performance work to late in the process is lunacy.

* I do already love the sound of my own voice blathering on, and I'm also pretty excited to spout off on topics that I know nothing about. so I figure I am at least partly qualified for the job.

Friday, December 26, 2014

OpenGL ES Performance: The iPhone 4 Performance Gap

Now that we've shipped X-Plane 10 Mobile, I can blog the occasional OpenGL ES performance note without revealing a stealth project.

X-Plane 10 Mobile requires iOS 8.  One reason why Chris set the OS requirement so high was to intentionally exclude the iPhone 4 from our device set.

For most of the project, we worked with the iPhone 4 as our minimum device, and for the entire project, it suffered performance problems that we didn't see in any of the newer devices.  The iPad 2 and iPhone 4S (the next hardware up) both performed significantly better.

I don't know what caused the gap, but it wasn't a "this phone is 20% slower" or "this ipad has 4x the shader units" kind of gap.  It was more like "this code runs fine on newer devices and we can just see the iPhone 4 trying to beat itself to death with its old-style dock connector so it doesn't have to run another frame of our main rendering loop".

I do not know what was going on under the hood, but I can share a few observations besides "If the 4 is having performance problems, it may be specific to the iPhone 4."

  • The iPhone 4 was the only device that would get bottlenecked on vertex count.  This was a real problem for us because we had models that we couldn't cut vertex count on without our artists spending a ton of time.  We had already LODed out everything expendable on the 4 and we were still getting jammed in the 3-d cockpit.
  • The iPhone 4 is very sensitive to the number of varyings and how they are packed!!!!  I found that packing fog and emissive light level into a 2-component varying significantly improved performance compared to having them as individual scalers.  (Of course, cutting the number of varyings made the biggest improvement.)
  • The iPhone 4 seemed to be spending significant driver time recycling the pool of "orphaned" buffers - that is, VBOs that had been completely discarded and respecified on a per-frame basis.
I can't say what was going on inside the driver, but I can say that all of these things were changes in the kinds of performance problems we were having, not just a matter of degree.

Once we cut the iPhone 4 and the 4S became the "runt of the litter" handling the low-end became a lot easier.  The iPhone 4S is incredibly capable for an older-generation smart-phone, and while it is the first to run out of CPU cycles or fill rate, the losses were proportional to spec, not "fall on your face and die."

I'm hoping to post a little bit more about performance tuning OpenGL ES in future posts, but from this point forward, any advice I have will apply to the 4S and above.  Having cut the iPhone 4 from our product, I no longer have time to figure out what makes it go fast.*

* One of the difficulties of OpenGL and OpenGL ES is that while the spec specifies what the APIs do, they don't specify how fast they do it.  Performance isn't guaranteed, deterministic, or often even published except in IHV presentations.  One of the big pluses of Metal (and possibly GLNext) is deterministic performance - an API that tells you what will be expensive and what won't be.

Tuesday, October 29, 2013

Good Advice From GIT

I found this in a GIT man page.  (Emphasis added by me.)
union
Run 3-way file level merge for text files, but take lines from both versions, instead of leaving conflict markers. This tends to leave the added lines in the resulting file in random order and the user should verify the result. Do not use this if you do not understand the implications.
I thought that was pretty good advice for all programming (but it does apply double to GIT).

Wednesday, October 02, 2013

3-d Mouse Testing

I just finished fixing a long-running bug in BrickSmith's culling code; this post is a review of ways to do mouse testing of a 3-d model.

Modelview Space

The original BrickSmith code tested the mouse in the modelview-space of the entire model:
  • The mouse click is treated as a ray in normalized device coordinates (that is, it goes from Z=-1 to Z=1 at an X and Y that are proportional to the click on the viewport).
  • The mouse click ray is transformed by the inverse of the model-view + projection matrices to get a test ray in model-view space.
  • Each test triangle is forward transformed by its relative transformation before testing.
  • The ray and triangle are tested via something like Möller-Trumbore.
There are a few things to note about this original approach:
  • We need the depth of intersection to sort our hits, and we get that from the algorithm for free, which is nice.
  • The algorithm forward transforms every triangle (sad) but it also does not have to calculate the inverse matrices of any of the sub-transforms of our model. Since LDraw sub-transforms can be pretty much anything in a 4x3 matrix, not having to run a real 'inverse' is nice.
  • Because we aren't doing plane-based intersections or using matrix inverses, there is no need to cache any data to make the intersection tests faster.  Since BrickSmith is an editor, not having to precompute 'helper' data for our model  (that would need to be recalculated every time the user edits) is a win.
  • There is no culling or early out; a hit test looks at pretty much the entire model, which is not fast.
BrickSmith never shipped with a model-view-based approach to the marquee selection (which selects every triangle intersecting an axis-aligned bounding box (AABB) in screen space.

Had we stayed with a modelview-space approach, I believe the marquee selection would have been implemented by taking the intersection of all triangles on the positive side of six planes: four planes going through the camera position and the selection box, as well as the near and far clip planes.  In other words, the marquee selection forms a tighter-frustum in which we want to find intersections.

Had we had such a frustum test, it would have had two other uses:
  • Accelerating rendering (by discarding any rendering that was fully outside the 'big' frustum that is the viewport) and
  • If the frustum test is faster than our ray-triangle hit test (above), we can accelerate single-point testing by first culling the entire scene to a frustum that 'barely' surrounds our hit ray (e.g. via a marquee selection of just a few pixels).
Since there was no marquee selection or fast frustum test in the original code, there was no utility to accelerate point testing or drawing.  Mousing over very large models could be very slow (because the real-time coordinates-under-mouse display has to do a hit test to determine the depth under the cursor) and rendering very large models was slow even if they were mostly off screen.

Screen Space

When I looked at getting a frustum test going (to implement marquee selection), I came up with an idea that I thought at the time would be very clever: work entirely in screen space.  Working in screen space creates a more fundamentally limited API (because all queries must be specified in screen space points and AABBs, whereas the old code could test any ray in any direction) but BrickSmith was only using screen-space tests, and they can potentially be done quite a bit quicker.  Here's how the new code worked:
  • Point queries and AABB queries are always provided in normalized device coordinates (with returned depths also in NDC from -1 to 1).
  • To test triangles, we forward transform them all the way through their complete matrix (model view and projection) and then do a perspective divide.
  • The resulting primitive is 2-d in its extent and can be tested very cheaply.  In particular, the AABB test is dirt cheap in 2-d.
We now have our frustum test and can thus rapidly reject things that are either off screen or far from our click point.  To make things even faster, the code tests the AABB of containers of parts first (by trnasforming the AABB from its local model space to screen space, building a new AABB around it, and testing that).  This hierarchical culling of bounding boxes is what provides the real speed.  Given an MPD model with most of the sub-models off screen, we pay almost nothing for them.

One more win: because the frustum surrounding a single mouse-click test is so tiny, almost everything in a model is rejected early when testing for the mouse, so mouse-over-the-model is fast even when a huge model is entirely on screen.

Clip-Space

At this point I was feeling pretty clever and smug (because for a really big model, the hierarchical culling could give you a 10x or 50x speedup) and happy (that my models were usable again).  But of course if you have been working with 3-d graphics for a while, you're probably face-palming yourself at the big leak in the abstraction of working in clip space.

Here's the thing: BrickSmith normally keeps your entire model behind the near clip plane.  It's like your model is sitting on a table and you are hovering over it; you never actually jam your face into the model.  So the fact that I was working in screen space was not a problem because the entire model always had a sane representation in clip space.

Until it didn't.  I implemented a "walk-through" camera that would let the user get inside their model and look around.  For users making modular buildings, this leads to some pretty cool screen-shots.  But it also led to low framerate and (even weirder) whole parts of the model just disappearing randomly!  I looked at the math over and over and there just wasn't something broken.

Of course what was broken was not a math mistake, it was a lack of clipping.  The problem is that in screen space, geometry in front of the near clip plane is eliminated by the GPU; therefore we don't have to worry about the strange things that happen in front of the near clip plane.  But in my hit-testing code, those vertices were being processed as is.

After a standard glFrustum transformation, vertices in clip space (homogeneous coordinates) have their -Z in their W coordinate which will become a divisor to the final device-space coordinate.  X and Y have been multiplied by the near clip plane.  So:
  • At the near clip plane, geometry does not change size.  In other words, you have a simple 2-d coordinate system.
  • Behind the near clip plane, things get smaller as they get farther away - this is not astonishing.
  • In front of the near clip plane, things get really big, really fast as we divide by a fractional Z.
  • At Z = 0 (parallel to the camera) we get a divide-by-zero, which is exciting.
  • Behind the camera, the divisor is negative (Z is positive, but the W coordinate has been negated already to get the rest of rendering to be not insane), so everything is mirrored in X and Y!
So to review, we have INFs floating around our coordinates, some triangle vertices may have been mirrored around the origin, and some geometry might be scaled to be "freaking huge" if it's between the near clip plane and 0 in camera coordinates.

The disappearing models happen when we have a model-view-space AABB that is to the left side of the screen in front of the camera and to the right behind the camera (because it is at a 45 degree angle to the eye-space Z axis, for example); in this case the "behind us" part of the AABB flips, and the whole AABB is to the left side of the screen in screen space and gets culled.

(This case happens a lot when we have long, thin models that we are in the middle of, viewed at a 45 degree angle to their long axis.   In other words, walk around a modular city and the city is going to disappear on a regular basis as you pan the camera.)

The fix I found and implemented is stupid but effective: clip all transformed triangles in clip space before the perspective divide.
  • The AABB is treated as a cube and each of its 12 edges are clipped to produce a new point set.  Since the AABB was going to have a new AABB built around its 12 points, doing this with a clipped point set isn't much different.
  • Triangles are clipped, sometimes resulting in zero or two triangles, before the perspective divide.
Besides fixing mouse testing, this clipping improves rendering speed in walk-through views significantly: geometry that is fully behind the near clip plane is completely removed from consideration, even if its projection would be on-screen; similarly an AABB that is in front of us but off screen to the left and behind us directly is clipped and the AABB tightened to not contain a 'false intersection' in front of us.

Other Options

There were a few other options I considered but haven't had time to code.

One option would be to do what X-Plane does (for culling) and work in eye-space, using side-of-plane tests for our frustum culling.  The test (for points) is cheap and would work without clipping.

(X-Plane does its ray-triangle mouse-testing in the coordinate system of the model itself; because all of the local transforms are simple rotations or translations, applying the inverse to the test ray is really cheap, and it avoids having to transform the entire model.)

Another option would be to work in homogeneous coordinates.  I am still not sure if this is mathematically possible, but the idea would be to do the ray-point intersection in clip space, and then check whether the clip-space intersection is behind the near clip plane.  This would let us do a single ray-triangle intersection without special-casing out the clipping logic, then reject a single point later.


Saturday, August 31, 2013

Theoretical Engineering: Occlusion Culling for BrickSmith

I have been pushing for a formal level-of-detail scheme for LDraw; the performance problem is that LDraw bricks contain a fairly high vertex count; at low zoom a large model might be entirely on screen, resulting in frame-rate limited by pure vertex count.  (The vertices are on the GPU, the batches are drawn with array-attribute-divisor instancing, and the destination viewport is small with simple shaders, so the limiting factor is trying to cram 125M vertices through the GPU.)

During the discussion, some developers brought up the question of occlusion culling.  In many of these wide views, a fair amount of the geometry is going to have no contribution in the final image due to occlusion by other, closer bricks.

Occlusion culling is tricky stuff. Occlusion queries on the GPU are not available to the CPU until drawing completes; recent GPUs are finally gaining the ability to use the occlusion query result directly, eliminating back-to-the-CPU latency.  CPU-side occlusion queries are usually facilitated by heavy pre-processing of the source data, but that isn't an easy option in BrickSmith because it is an editor, not a viewer (and therefore the source data is continually changing).

What follows is speculative engineering, e.g. how could occlusion culling work on a modern GPU. Looking at the requirements for the design we can then infer as to whether such a scheme is feasible. (Spoiler alert: it is not.)

When looking at the algorithm that follows, please keep one thing in mind: the limiting factor here is vertex count - that's what we are trying to optimize.  This is a very unusual situation in 3-d graphics; most systems are designed to not have this problem. In particular, we are not fill rate bound, so we can expect GPU commands involving small numbers of vertices to complete reasonably quickly, because the GPU isn't drowning in last-frames pixel-fill operations.

Here's the basic idea:
  • Every part that we draw is divided into two sets of geometry: the crude occluders and the rest of the part.  The crude occluders are the non-transparent triangles and quads (no lines) that are big enough to be worth drawing early to start filling space.  The goal is to keep fill rate high while lowering vertex count, so that we can fill a lot of the screen quickly and then see what is left. (For example, for a 2x4 brick, the outer 5 quads of the brick top and sides would be the crude occluders; the entire interior, edging, studs and tubes and line work would be the rest.)
  • Each brick is pre-'normalized' so that it exists within a unit cube; when we draw our brick instances, the transform matrix will include scaling factors to reconstitute the brick, not just the translation and rotation we have now.  (This means that we can know from the instance data where a brick will be.)
  • We set up an off-screen render target so that we can get direct access to the depth buffer; we'll blit the whole thing down to screen later, and it'll be cheap (one single-pass screen-sized blit with no processing).
  • We start drawing the way the engine works now: we traverse the data model and dynamically build instancing lists for all opaque parts.  (Translucent parts get dumped in a holding bay for later, slower, sorted processing; they won't participate in occlusion.)  This step exists in the current BrickSmith source code and is fast enough that CPU is not the bottleneck even on my older 2008 Mac Pro.
  • We draw the instance list with the crude occluder meshes. This should, in theory, complete quickly, because the vertex count is low, early Z can take effect, there is no blending, and our shaders are not complicated.
So far what we have basically done is rapidly put down a pre-Z style pass with some of our geometry. If our only goal was to limit fill rate, this would be a pretty good setup.  But we need to eliminate actual drawing calls.

Our next step is to grab the depth buffer and use render-to-texture to ping-pong it, forming a hierarchy of depth buffers.  For each depth buffer we will use a "farthest" policy, so that a low-res depth texel contains the farthest depth of the four high-res depth texels that it covers.

Now we can run GPU culling.  For each instance list VBO, we run it through a geometry shader, treating one instance as one vertex.  (In BrickSmith, instances are a 4x3 affine transform plus two RGBA colors, so our 20 floats can be five attributes in the vertex stream.)  The geometry shader calculates the screen space AABB of the instance, fetches depth from the appropriate level of the Z buffer, and determines whether the brick's bounding box is clearly behind every single pixel already drawn into the AABB's location.  The geometry shader then outputs zero or one vertices depending on the cull.

The geometry shader must be fed into a transform feedback stage so that we can (1) capture the vertices back rather than render them and (2) so the GPU will retain the number of vertices for later use in a glDrawTransformFeedbackStreamInstanced call.

(We could implement frustum culling here too, but I think we need to keep the original CPU frustum cull from the initial draw in order to not draw the entire model off-screen when we are setting up our early-Z buffer.)

Note that BrickSmith currently uses one giant VBO for its instance list - the individual bricks know which part of the instance list they own, so we save on VBO count.  However, if we want to reduce the instance count we have a serious problem: our segment buffering will be borked.  One way to solve this is to transform each primitive into its own transform feedback VBO, so that we can know how many primitives to draw and where they start.  This would probably be bad for total draw performance.

(There might be a better way to organize this, perhaps with multi-draw-indirect or compute, or atomic writes to a memory buffer.  I have not dug into this because, as you will see from later in the post, this scheme is not actually shippable.)

Finally, with the instance lists culled, we can go and draw 'the rest' of the model; since the instance lists have been trimmed, we skip most of the geometry for every occluded brick.  Thus we will get a large vertex savings on every occluded brick.  (For example, a 2x4 brick will save better than 97% of its vertices, even after excluding lines, when occlusion culled.)

There are a few reasons why we can't actually ship this.
  • The biggest single problem that I see is that the part library would have to be pre-tagged into rough occluder geometry and the rest.  Given the size of the part library and its structure as a series of "subroutine"-style partial primitives, such a partitioning would both be a huge manual undertaking (that could in theory be automated) and it would result in a transformation of the parts that could not be fed back into the LDraw system.
  • The code above assumes the most modern GPU features.  Instanced transform feedback is an OpenGL 4.2 feature, so it would require a DX11-class GPU.  Apple has not yet shipped OpenGL 4.0, so at best we'd be dropping support for Lion and Mountain Lion.
  • Using separate VBOs for each brick (to get the cull list) would be a serious performance problem.  It might be possible to overcome this with other new OpenGL techniques, but the requirement on a new driver would remain. GPU-culling could also be performed only on parts for which there was a large part count, applying an 80-20 rule to culling.
  • Finally, the technique above is very complicated - it is probably more implementation work than the entire renderer-rewrite was, and BrickSmith is not a commercial project; those of us who poke at it do so in our very, very limited spare time.  I don't thin anyone involved in the project has time for such serious GPU work.
Edit: reading Daniel Rákos' article more carefully, I see he uses an asynchronous query to get the amount of feedback delivered.  In theory this would let us source our instances out of a single VBO even after culling, which would be a performance win.  However, I am not sure that this would be completely free. The algorithm without this modification can run entirely asynchronously - that is, the entire frame can be "late" and still render.  Pulling back the instance count to the CPU therefore might introduce the first sync point and hurt performance.  (E.g. if the previous frame hasn't completed, the next frame's query for culling count will block because the GPU wans't available.)

The conclusion is, of course, just to use crude reduced-LOD models for far viewing and go home happy.  But having spent the time considering what it would take to do GPU culling, I figured I should at least write it down.