The Hacks of Life: 01/01/2010

Sunday, January 31, 2010

To Strip or Not To Strip

In this post I will try to explain why a performance-focused OpenGL application like X-Plane does not use triangle strips. Since triangle strips were the best way to draw meshes a few years back, a new user searching for information might be confronted by a cacophony of tutorials advocating triangle strips and game developers saying "indexed triangles are better" without explaining why. Here's the math.

Please note: this article applies to OpenGL desktop applications, typically targeting NVidia and ATI GPUs. In the mobile/embedded space, it's a very different world, and certain GPUs (cough, cough, PowerVR, cough cough) have some fine print attached to them that might make you reconsider.

Why Triangle Strips Are Good

If you are drawing a bunch of connected triangles, the logic in favor of triangle strips is very simple: the number of vertices in the strip will be almost 66% fewer than the number you'd have if you simply made triangles. The longer the strip, the closer to that savings you get. Since geometry throughput is generally limited by total vertex count, this is a big win.

Ten years ago, that's all you needed to know. Of course, making triangle strips is not so easy - some meshes simply won't form strips. The general idea was to make as many strips as you can, and draw the rest of your triangles as "free triangles" (e.g. GL_TRIANGLES, where each triangle is 3 vertices, and no vertices are shared).

(By the way, to see how to use the tri_stripper library to create triangle strips, look at the function DSFOptimizePrimitives in the X-Plane scenery tools code. Why DSFLib does this will have to be explained in another post, but suffice it to say, there is no hypocrisy here: X-Plane disassembles the triangle strips in the DSF into "free triangles" on load.)

Indexing Is Better

In an indexed mesh, each vertex is stored only once, and the triangles are formed from a set of indices. (In OpenGL this is done by moving from glDrawArrays to glDrawElements.) With an index, you pay more (2 or 4 bytes) per each vertex, but you don't ever have to repeat the geometry of a vertex.

When is it worth it to index? It depends on the size of your indices and vertices, but it is almost always a win. For example, in an X-Plane object our vertices are 32 bytes (XYZ, normal, one UV map, all floating point) and our indices are 4 bytes (unsigned integer indices). Thus a vertex is 8x more expensive than an index. So if we can reduce 1/8th of the geometry via sharing, we will have a win.

Consider a simple 2-d grid: even with triangle strips, each adjacent strip except the edges are going to share a common edge. Thus if we use indexing, our 2-d mesh is going to have a savings that nearly approaches 2x for the geometry! That is way more than enough to pay for the cost of the indices.

So the moral of the story is: any time your geometry has shared vertices, use indexing. Note that this won't always happen. If you have a mesh of GL_POINTS, you will have no sharing, so indexing is a waste. In X-Plane, our "trees" are all individual quads, no sharing, so we turn off indexing because we know the indexing will do no good.

But for most "meshed" art assets (e.g. anything someone built for you in a 3-d modeler) it is extremely likely that indexing will cut down the total amount of data you have to send to the GPU, and that is a good thing.

Triangle Strips Aren't That Cool When We Index

Now in the old school world, a triangle strip cut the amount of geometry down by almost 3x. Awesome! But in the indexed world, a triangle strip only cuts down the size of the index list by 3x. That is...not nearly as impressive. In fact, in X-Plane's case it is only 1/8th as impressive as it would have been for non-indexed geometry.

The take-away thing to observe: once we start indexing (which really makes geometry storage efficient) triangle strips aren't nearly as important as they used to be.

Restarting a Primitive Hurts

So far we've talked about ideal cases, where your triangle strips are very long, so we really approach a 3x savings. Here's the hitch: in real life triangle strips might be very short.

The problem with triangle strips is that we have to tell the card where the triangle strips begin and end, and that can get expensive. You might have to issue a separate glDrawElements call for each strip.

You don't want to make additional CPU calls into the GL to minimize the size of a buffer (the index buffer) that is already held in VRAM. CPU calls are much slower. And this is why X-Plane doesn't use strips internally: it's faster to be able to make one draw call only for mesh, even if it means a slightly bigger element list.

Now if you are a savvy OpenGL developer you are probably screaming one of two things:

What about glMultiDrawElements? I point you to here, and here. Basically both Apple and NVidia are suggesting that the multi-draw case may not be as ball-bustingly fast as it could be. There is always a risk that the driver decomposes your carefully consolidated strips into individual draw calls, and at that point you lose.
What about primitive restart? Well, it's nvidia only, so if you use it, you need to case your basic meshing code to handle its not being there. And even if it is there, you pay with an extra index per restart. If you have really good strips, this might be a win, but when the strips get small, you're starting to eat away at the benefit of shrinking down your indices in the first place. (The worst case is a triangle soup with no sharing, so you get no benefit from tri strips and you have to put a "restart" primitive into every 4th slot.)

And this brings me to one more concern: even if you do have some nice triangle strips in your mesh, you may have free triangles too, and in that case you're going to have to make two separate batch calls (GL_TRIANGLE_STRIP, GL_TRIANGLES) for the two "halves" of the mesh. So even if you are getting a triangle strip win, you're probably going to double the number of real draw calls (even with multi-draw) just to shrink an index list down.

Index Triangles

Thus the X-Plane solution: any time we have a mesh, we use indexed triangles and we go home happy.

We always draw every mesh in only one draw call.
We share vertices as much as possible.
We are in no way dependent on the driver handling multi-draw or having a restart extension.
We run at full speed even if the actual mesh doesn't turn to strips very well.
The code handles only one case.

As a final note, this post doesn't discuss cache coherency - that is, if you are going to present the driver with a "triangle soup", what is the best order? That will have to be another post, but for now understand that the point of this post is "indexed triangles are better than strips" - I am not saying "order doesn't matter" - cache coherency and vertex order can matter, no matter how you get the vertices into the GPU.

Friday, January 29, 2010

The Devil Is In the Details

I seem to have become horribly addicted to Stack Overflow. It makes sense, but I just feel a compulsion to answer other people's questions about OpenGL.

But there is one kind of question that drives me a little bit nutty...it goes like this:

I am new to OpenGL and I hope someone can help me. I am drawing a series of interlocking mobeus rings using glu nurb tessellateors, GL_TEX_ENV_COMBINE, a custom separate alpha blending mode, the stencil buffer, and polygon offset.

For some reason one of my polygons are clipped. If I change the combine mode to add, the purple ones move to the left. If I change the polygon offset, the problem persists.

Any ideas?

My fellow OpenGL programmers: Stack Overflow is not a debugging service.

Stack Overflow is a great idea, and the site execution is really pretty good: automatic syntax formatting appropriate to code, tagging, search works pretty well. It is good for answering questions.

But a post like the above: it's not a question, it's a cry for help. (The answer, technically, is "yes", but I don't want to post that and get bad karma.) There are about a million things that could be going wrong from the fundamental design to the nuts and bolts.

In my experience, OpenGL bugs fall into three categories:

There is a one-off stupid mistake deep in the implementation that causes all hell to come down. Fixing the bug requires the usual techniques (divide and conquer and printf) until the bug is found and fixed. Stack crawl is not the right tool - any programmer who is going to fix this needs to be able to modify and re-run the app repeatedly, and no one is going to do this for you for free anyway.
The overall algorithm design is wrong because of the design limits of the GL. Another programmer could at leaset tell you that you have this problem, but only if you know enough to ask the right questions. And if you know enough to ask, heck, you probably wouldn't have designed the code this way in the first place.
The GL implementation has a known bug. This is the one case where stack crawl can help, but the above question is not that. The programmer needs to have cut the problem all the way down to the one mysterious behavior (e.g. my color is showing up in one of my vertex attributes but the GL spec says this should not happen). In this case, at least having confirmation from other programmers that the bug is really in library code helps provide closure to the investigation.

My rant here is directed against case 1. If you need to post a long and detailed description of your code (as opposed to a question), you're not really asking a question, you're asking for someone to do your job for you.

Enough blogging, I'm going to go back to being grumpy now.

Thursday, January 28, 2010

I've Got the Blues

I have learned many things today - some of which you may already know. Did you know that in German, to be "blue" means to be drunk, not sad? Maybe I will have a blue Christmas next year!

I learned this while working with alpilotx on a quirky bug: experimental instancing code was causing the instanced geometry to look completely goofy and turning the rest of the scene pretty much completely blue. The bug appeared only on NVidia hardware on Linux.

Well if you read the fine print closely, you'll find this:

NVIDIA’s GLSL implementation therefore does not allow built-in vertex attributes to
collide with a generic vertex attributes that is assigned to a particular vertex attribute
index with glBindAttribLocation. For example, you should not use gl_Normal (a
built-in vertex attribute) and also use glBindAttribLocation to bind a generic vertex
attribute named “whatever” to vertex attribute index 2 because gl_Normal aliases to
index 2.

This is really too bad, as the GL 2.1 specification says:

There is no aliasing among generic attributes and conventional attributes. In
other words, an application can set all MAX VERTEX ATTRIBS generic attributes
and all conventional attributes without fear of one particular attribute overwriting
the value of another attribute.

I can report, with a full head of steam and outrage, that the current NVidia drivers on Linux definitely work the way NVidia says they do, and not the way the spec would like them to. Documenting what their code does...the nerve of it! Those NVidia driver writes!

What? You already knew this? Ha ha, so did I, just kidding, I was just quizzing you...

I don't think this is really news at all - I think I'm just really late to the party. In particular, X-Plane 8 and 9 run all of their shaders entirely using the built-in attributes to pass per-vertex information; sometimes that information is quite heavily bastardized to make it happen.

I'm sure there are reasons why this is evil, but I can tell you why we did it: it allows us to have a unified code path for scene graph, mesh, and buffer management. Only our shader setup code is actually sensitive to what the actual hardware is capable of doing - the rest runs on anything back to OpenGL 1.2.1.

(This is actually not 100% true. In some cases we will tag additional attributes to our vertices only on a machine with GLSL - this is a simple optimization during mesh build-up to save time and space for machines that will never use the extra attributes anyway. An example of this is the basis vectors for billboarding that are attached to trees: no GLSL means no billboarding the trees, so we drop the extra basis vectors.)

The moral of the story: let the linker pick your attribute indices.

Templating Functions

(This is a rehash of an answer I posted on Stack Overflow, after reading the previous posts and experimenting...probably bad form to report here, but I want all my C++ drek in one place.)

Template parameters can be either parameterized by type (typename T) or by value (int X).

The "traditional" C++ way of templating a piece of code is to use a functor - that is, the code is in an object, and the object thus gives the code unique type.

When working with traditional functions, this technique doesn't work well, because a change in type doesn't indicate a specific function - rather it specifies only the signature of many possible functions. So:

template int do_op(int a, int b, OP op) { return op(a,b,); } int add(int a, b) { return a + b; } ... int c = do_op(4,5,add);

Isn't equivalent to the functor case. In this example, do_op is instantiated for all function pointers whose signature is int X (int, int). The compiler would have to be pretty aggressive to fully inline this case. (I wouldn't rule it out though, as compiler optimization has gotten pretty advanced.)

One way to tell that this code doesn't quite do what we want is:

int (* func_ptr)(int, int) = add; int c = do_op(4,5,func_ptr);

is still legal, and clearly this is not getting inlined. To get full inlining, we need to template by value, so the function is fully available in the template.

typedef int(*binary_int_op)(int, int); // signature for all params template int add(int a, int b) { return op(a,b); } int add(int a, b) { return a + b; } ... int c = do_op(4,5);

In this case, each instantiated version of do_op is instantiated with a specific function already available. Thus we expect the code for do_op to look a lot like "return a + b". (Lisp programmers, stop your smurking!)

We can also confirm that this is closer to what we want because this:

int (* func_ptr)(int,int) = add; int c = do_op(4,5);

will fail to compile. GCC says: "error: 'func_ptr' cannot appear in a constant-expression. In other words, I can't fully expand do_op because you haven't given me enough info at compiler time to know what our op is.

So if the second example is really fully inlining our op, and the first is not, what good is the template? What is it doing? The answer is: type coercion. This riff on the first example will work:

template int do_op(int a, int b, OP op) { return op(a,b); } float fadd(float a, float b) { return a+b; } ... int c = do_op(4,5,fadd);

That example will work! (I am not suggesting it is good C++ but...) What has happened is do_op has been templated around the signatures of the various functions, and each separate instantiation will write different type coercion code. So the instantiated code for do_op with fadd looks something like:

convert a and b from int to float.
call the function ptr op with float a and float b.
convert the result back to int and return it.

By comparison, our by-value case requires an exact match on the function arguments.

Wednesday, January 27, 2010

Debugging GLSL

From a past post:

There are only two debugging techniques in the universe:

printf.
/* */

Is that true when writing GLSL shaders? Yep. Commenting out things is natively available. What about printf? The GLSL equivalent of printf is

gl_FragColor.rgba = vec4(stuff_i_want_to_see,...);

That is, you simply output an intermediate product to the final color, run your shader, then view something else. This is how I debug some of the more complex shaders: I view each product in series to confirm that my intermediate values aren't broken. Since the sim is running at 30 fps, I can move the camera and confirm that the values stay sane through a range of values.

The numeric output is often not in a visible range - to get around that I often use a mix of abs, fract (to see just the lowest bits), scaling, and normalize() to sanitize the output.

One app feature is critical: make sure you can reload your shaders in a heart-beat. In X-Plane we have a hidden menu command to do this. This way, you can move your printf, recompile the shaders, and see the change.

A visual debugger is a useful tool for debugging C/C++ because you don't have to commit to what you will view before compiling - you can just print any intermediate product from the debugger. For GLSL, make the recompile cycle fast, and you'll be able to simply edit the code in near-realtime.

A Tile Too Far

I've been playing with shading algorithms lately. One such algorithm is the "number puzzle". The basic idea is to take a repeating texture that is divided into sub-tiles and randomly move the tiles around. This is implemented in-shader by separating the UV coordinates and randomizing the bits that represent the "tile". (This is usually all but the lowest N bits.) The tile choice is made by sampling a random noise map, and the UV input to that comes from the upper bits so that it is stable (e.g. so we only switch tiles at the tile boundary).

One nice property of the number puzzle is that if you don't have shaders, you simply get a repeating texture. This is handy because the art assets and code doesn't have to be cased out for a fixed function case - we end up with uglier, but valid output.

It occurred to me today that the number puzzle can be atlased - that is, the random tile we pick could be constrained by the upper bits of the UV map, so that (by using a broad "space" of UV coordinates) we can pick from a set of tiles within a larger texture. This is a win because it means we can texture atlas and thus merge a bunch of differently tiled surfaces into one batch.

There is just one problem with this technique, one that might be a deal breaker as long as fixed function is necessary: when the shader is off, the atlasing gets ignored and we end up with junk. There really isn't a good way around this..wrapping + atlasing are, as far as I know, incompatible in the fixed function pipeline.

Thursday, January 14, 2010

Fast Paths

When looking at code speed, you can put on two different hats:

When designing an API, you might ask: how do we prevent a slow-down in the fastest possible path?
When implementing an API, you might ask: how does this affect overall performance?

They're not the same. Consider, for example, OpenGL state shadowing.

A well optimized OpenGL client program would not do this:

glEnable(GL_TEXTURE_2D); glDrawArrays(GL_TRIANGLES, 0, 51); glEnable(GL_TEXTURE_2D); glDrawArrays(GL_TRIANGLES, 108, 51);

The second enable of texturing is totally unneeded. The clever programmer would optimize this away. But what does the OpenGL implementation do? We have two choices:

Check the texture enable state before doing a glEnable. In the case where the programmer didn't optimize, this saves an expensive texture state change, and in the case where the programmer did optimize, it is an unnecessary comparison, probably of one bit.
Do not check - always do the enable. In the case where the programmer didn't optimize, the program is slow; in the case where the programmer did optimize, we deliver the fastest path.

In other words, it is a question of whether to optimize overall system performance in a world where programmers are sometimes stupid or lazy, or whether to make sure that those who write the fastest code get the fastest possible code.

(In a real program, detecting duplicate state change is very difficult, since code flow can be dynamic. For example, in X-Plane we draw only what is on screen. Since the model that was drawn just before your model will change with camera angle, the state of OpenGL just before we draw will vary a lot.)

From my perspective as a developer who tries to write really fast code, I don't care which one a library writer chooses, as long as the library clearly declares what is fast and what is not.

This was the motivation behind the "datarefs" APIs in the X-Plane SDK: a dataref is an opaque handle to a data source, and we have two sets of operations:

"Finding" the dataref, where the link is made from the permanent string identifier to the opaque handle. This operation is officially "slow" and client code is expected to take steps to avoid finding datarefs more than necessary, in performance critical locations, in loops, etc. (Secretly finding was linear time for a while and is now log time, so it was never that slow. )
Reading/writing the dataref, where data is transferred. This operation is officially "fast"; Sandy and I keep a close eye on how much code happens inside the dataref read/write path and forgo heavy validation. The motivation here is: we're not going to penalize well-written performance-critical plugins with validation on every write because other plugins are badly written. Instead the failure case is indeterminate behavior, including but not limited to program termination. (I'm not ruling out nasal demons either!)

This notion of "protecting the fast path" (that is, making sure the fastest possible code is as fast as possible) serves as a good guideline in understanding both C and C++ language design; in most cases, given a choice, C/C++ protect the fast path, rather than protecting, well, you.

A simple example: case statements. Case statements have this nasty behavior that they will "flow through" to the next statement if break is not included. 99% of the time, this is a programmer error, and it would be nice (most of the time) if the language disallowed it. But then we would lose this fast path:

switch(some_thingie) { case MODE_A: do_some_stuff(); case MODE_B: do_shared_behavior(); }

In this case, where we want specialized behavior and then common behavior in mode A, but only the common behavior in mode B, flow-through lets us write ever so slightly more optimal code.

If this seems totally silly now, in a world where optimizers regularly take our entire program, break them down into subatomic particles, and then reconstitute them as chicken nuggets, we have to remember that C was designed in the 70s on machines where the compiler barely could run on the machine due to memory constraints; if the programmer didn't write C to produce optimal code, there wasn't going to be any optimal code.

Tuesday, January 05, 2010

Coding For Two Audiences

I'm going to keep going with the "pithy one-liner thing", because obviously computer programming can be completely reduced to drinking coffee, cursing, and a few sentences you can write on the back of your hand. Okay, here goes:

All code is written for two audiences.

Ha! You could put that in a fortune cookie. Seriously though, that is the truth, and it is the driving motivation behind my style guidelines for headers.

The first audience for your code is, of course, the compiler. The compiler is a tool that writes your application - your code is a set of instructions to the compiler about what you want it to do. Since compilers aren't very creative (at least we hope) you have to be very precise, and the compiler tends to be very picky. A compiler gets all bent out of shape when you write things like:

viod set_flaps(int position);

No imagination, those compilers. They also aren't real good at catching things like this:

if (x=0) init_subsystem();

(Not quite fair - compilers now do catch some of the more knuckleheaded things you can do - but look at the C++-tagged posts in this blog for examples of what the compiler thinks isn't a bad idea.)

So lots of books have been written about how to write code that won't confuse the compiler and you'll find engineers who insist on writing if (0 == x) and such. That serves the first audience well. But what of the second audience?

The second audience is the humans who will have to read the code in the future in order to use or change it. That includes future you, so for your own sake, be nice to this audience. Code says something to people, not just to compilers. Consider this:

typedef void * model_3d_ref; model_3d_ref load_model_from_disk(const char * absolute_file_path); void draw_model(model_3d_ref the_model, float where_x, float where_y, float where_z); void deallocate_model(model_3d_ref kill_this);

Without knowing what the hell we're doing, if you know C and have worked as a computer programmer a few years, you probably already have a rough idea of what I'm trying to do with those declarations. Humans read code, and humans infer things from the code that will be necessary to work on it.

The compiler doesn't read your code like this - the following code is exactly the same to a compiler:

void * load_model_from_disk(const char *); void draw_model(void *, float, float, float); void deallocate_model(void *);

As humans though, the above is a lot more like gibberish.

Header Nazi

And that is why I am a header Nazi. Here's how I do the math: if you write code that is useful, bug free, and reasonably well encapsulated/insulated, then people are going to spend a lot more time looking at the header to understand the interface than they will spend looking at the implementation. (In fact, it should be unnecessary to look at the implementation at all to use the code.)

For this reason, I want my headers to be clean, clean, clean. I want them to read like a book , because that's what they are: the user's manual for this module to the humans who will use it. This tick comes out in a few forms:

I prefer physical insulation (putting code in the cpp file) to logical encapsulation (putting things in the private: part of an object) because it gets the implementation details out of sight. It keeps the human readers from being distracted by how the module works, and helps keep inexperienced programmers from mistaking implementation for interface.
If I have to inline for performance, I keep the inline out-of-class at the bottom of the header so it doesn't detract from readability.
Bulk comments about usage go in the header to form a document.
Any semantics about calling conventions go in the header so that examining source is not necessary.

Saturday, January 02, 2010

When To Rewrite

If one thing drives me crazy, it is reading claims in the flight simulator community that FS X needs "a total rewrite". Now FS X is our (now EOLed, at least temporarily) competition, but people have made the same claim about X-Plane, and it is just as stupid for FS X now as it was for X-Plane then. The users who claim a rewrite is needed are quite possibly not software engineers and certainly don't have access to a proprietary closed source code base, which is to say, they are completely unqualified to make such a claim. But "let's do a total rewrite" does persist as a real strategy in the computer industry - I have been on teams that have tried this, and I can say with some confidence: it is a terrible idea. To claim that 100% of the software should be thrown out is to fail to understand how software companies make money.

Joel's treatment on the subject is thorough and clear. I would only add that beyond the intensely poor return on investment of a total rewrite (e.g. spending developer time to replace field tested, proven code that users like with a track record for making money with new untested code that may be buggy without adding new features), the actual dynamics of a rewrite are even worse in practice. This is how I would describe the prototypical rewrite:

Software product X is first developed by a small team of grade A programmers - programmers who understand what they are doing completely, can ship product, fully chase down bugs, and understand the trade-offs of architecture vs. ship date. These programmers maybe don't always write the cleanest code, but when they write something dirty, they know why it's dirty, what they will do about it, and at what point it will make sense from a business standpoint to fix it. (And the fact that the "dirty" code shipped means: that time to fix the problem hasn't come yet.)

Once the product starts making money, the team grows, and the product goes into a feature mode - new versions get new features added into the code. The business model is to sell upgrades by putting features into the code on a timely basis. This is where things start to get tricky:

The business model rewards shipping new features. Thus the metric that the company should be looking at is "efficiency", e.g. how many man-months to get a feature valued at some number of dollars?
There is an opportunity cost to not shipping features, thus the team has been increased in size with "grade B" developers.
Now management has a serious problem: if the efficiency of the team is declining, is it because the grade B developers aren't as efficient (a known and acceptable risk) or because the code is becoming harder to work with?

Every feature is different, and it's likely that the original "A" team is working on the hardest features - the ones only they can do. So isolating and detecting that your code base is becoming fugly is going to be nearly impossible by management. If you have management by metrics (e.g. a management team that uses proxy metrics like bug count, KLOC and other such things but doesn't actually look at what the code says) they are not going to have any tools to recognize the problem. Combine that with the fact that every developer says every piece of code not written by himself/herself within the last 3 days is fugly, and management just doesn't know the extent of the problem.

Is the code base getting worse at this point? Almost certainly yes!

If the original design was business-optimal, it did not contain a bunch of code to make future expansion easy. (Side note: this is the right decision and this problem of architectural drift should not be solved by making the "grand design" in version 1. No one knows what features will actually be useful in version 2, so a "grand designed" version 1 is going to have a ton of crap that will never get productized and just take longer to ship in the first place.)
If the business model can't track efficiency and code quality, then the A team (the only ones capable of rearchitecting the design) are under strong pressure not to do so. In fact, they're getting the hardest problems and are probably critical path in every release; asking them to rearchitect to will seem like an impossibility.
The B team doesn't understand the design, and thus every feature they're putting in is probably screwing up the program a little bit more.

Now at some point the team will collectively notice that it has become really hard to actually ship anything. So many features have shipped on top of an architecture not meant to handle them that every new feature introduces bugs, side effects, unintended consequences, and developers are now spending most of their time trying to understand what the existing code does, rather than adding new things. This brings us to the third phase the "let's do a rewrite". Inevitably someone will get the idea that the entire code base should be thrown out and reworked, bringing clean code, the next big thing, world peace, etc.

Management is eventually convinced that this is a good idea, but can't accept the idea of not having revenue. So the team is split in half:

Half the team does maintenance updates on the existing code, to ship the next version, with new features, business as usual. This team will probably be in a bad mood, as they have to work in a pit of slime.
The other half of the team is split off to build the next-generation system, to ship one release after this one, on all new code.

The problem is: the next-generation approach will fail. Here's how:

The next-generation approach will start with an architecture that is too grand for its own good. Without the pressure to ship a 1.0 product, with the mandate not to ship product but to "clean the system up", and after years of dealing with crappy code, the next-gen design will be brilliant, but severely over-architected from the start. To expect the engineers in this situation to really be good about minimalism in architecture is to expect monkeys to fly. (If there are any technological fads going around, expect the new design to pick them up like a fleece attracts dog fur.)
When the marketing team realizes there is a "next generation" scheme, they will promptly hang every ridiculous feature they have ever thought of on the scheme. As long as you're rewriting the architecture, why don't you make it so that the entire system can be remotely accessed from your car radio? Can we make the user interface fully customizable so it can be skinned in baby blue or pink? We would like it to use this series of TLAs that Microsoft thinks are clever right now.
What won't go into the initial design is all of the small features (and bugs/design flaws that users think are features) that make the existing product a hit. So while we already have a hugely over-scoped product, it's going to pick up another set of features, the ones that really make money, late in the design. Of course, it was never architected for those features (hell, the designers probably thought the old code was "dirty" because of those features) and thus if this product ever ships, the code will be considered fugly as soon as it is finished.

Fortunately that will never happen. The end result of this tale is most likely not a new version shipping that's even worse than the old - it is the new version not shipping at all. Under any kind of pressure, management will move resources from the next-gen rewrite to maintenance of the existing version. If you're lucky, someone on the team will incrementally harvest pieces of the next-gen code and paste them into the existing code on an as-needed basis, making this entire effort the least efficient attempt at incremental refactoring you can possibly imagine.

I'm not entirely sure what the practical solution to this is. Since I now work at a company with only two full time developers (myself included) we have the advantage of there simply not being a B team to hose our design. It makes me wonder what management should have done? Possibilities include:

Hire only "A team" programmers and accept that the lost short-term revenue and higher cost of hiring only really top-notch employees is offset by the long term ROI of code that can efficiently generate features over a longer span. (It may be that this isn't a win, and that the above parable, while depressing, is the best ROI...products don't make money for ever, and the financially lucrative thing to do might be to ship 1.0, add features until it's dead, then kill it.)
Accept the cost of refactoring during work. This requires a really special type of programmer - you need someone who fully understands the architecture and the business model and can balance the two. This business model implies that the team trusts the absolute judgment of the very small (maybe only one) group of individuals who see both sides.
Keep the B team on a really, really, really short leash. This probably reduces their efficiency even more (to the point where they may not be useful ROI-wise) and probably drives those deveopers crazy.

So When Do You Rewrite

Implicit in all of this is a very basic idea: if the software is incrementally rearchitected during development, the long term return on investment is going to be a lot better. You're going to make more money (in the long term) because:

You're rearchitecting lots of smaller, easier to fix problems instead of the mother of all train-wrecks. The cost of working on code is non-linear and a function of interdependency, so it's almost always going to be a huge win to nip a problem in the bud.
The cost of adding features will stay low, which means the cost of your team isn't going to go up relative to output over time.

(This assumes that the value of a feature is constant through time - if there is a market window you have to hit in time, all bets are off.)

So when is the right time to do this rearchitecting? At this point in my programmer career, my answer is:

Right before you would lose any productivity by not rearchitecting.

(Note that this is after "right after you are completely sure that you will need to rearchitect.")

Basically you never want to rearchitect until you are 100% sure that it is necessary. Rearchitecting early risks wasted work. (I would say the cost is worse - you pay for the complexity of your software continuously so any unused abstractions are hurting your business.) Once you know that the feature you are doing is being made difficult by an existing inadequate architecture, you have 100% certainty.

Since the long term costs of rearchitecting are going to be less if you do it earlier, refactor as soon as it's holding you back. Start every feature with a behavior-neutral rewrite of the underlying code to be exactly how it needs to be - the actual feature work will be so much more efficient, and the resulting code will look like you could keep working on it without sticking needles in your eyes.

(But: don't start reachitecting on a module that is going to remain unmodified. You'll have to retest the module even though you didn't gain revenue. If you rearchitect now for the feature you will code in six months, you increase the amount of testing the release now is going to have. So make a note "this code will need work", and in six months, that's when you dig in.)