Friday, October 29, 2021

This One Weird Trick Let's You Beat the Experts

As I've mentioned in the past, one of my goals with this blog is to launch my post-development career in tech punditry and eventually author an O'Reilly book with an animal on the cover.* So today I'm going to add click-baity titles to my previous pithy quotes and tell you about that one weird trick that will let you beat the experts and write faster code than the very best libraries and system implementations.

But first, a Vox-style explanatory sub-heading.

StackOverflow is a Website That Destroys New Programmers' Hopes and Dreams

Here is a typical question you might see on Stack Overflow**:

I'm new to programming, but I have written a ten line program. How can I wrote my own custom allocator for my new game engine? I need it to be really fast and also thread safe.

StackOverflow is a community full of helpful veteran programmers who are there for the newbies among us, so someone writes an answer like this.***
Stop. Just stop. Unplug your keyboard and throw it into the nearest dumpster - it will help your game engine to do so. The very fact that you asked this question shows you should not be trying to do what you do.

You cannot beat the performance of the system allocator. It was written by a Linux programmer with an extraordinarily long beard. That programmer spent over 130 years studying allocation patterns. To write the allocator, he took a ten year vow of silence and wrote the allocator in a Tibetan monastery using the sand on the floor as his IDE. The resulting allocator is hand-optimized for x86, ARM, PDP-11 macro-assembler, and the Zilog Z-80. It uses non-deterministic finite state automata, phase-induced reality distortion fields, atomic linked lists made from enriched Thorium, and Heisenberg's uncertainty principle. It is capable of performing almost 35 allocations per second. You are not going to do better.
Discouraged, but significantly wiser, our young programmer realizes that his true calling in life is to simply glue together code other people have written, and vows to only develop electron apps from this point forward.

But what if I told you there was this one weird trick that would allow our newbie programmer to beat the system allocator's performance?

The Secret of My Success

Come closer and I will whisper to you the one word that has changed my life. This one thing will change how you code forever.  Here's what our young programmer needs to do to beat the system allocator:

Cheat.

Here, I'll show you how it's done.
static char s_buffer[1024];
void * cheat_malloc(size_t bytes) { return s_buffer; }
void cheat_free(void * block) { }
(Listing 1 - the world's worst allocator.)

One thing you cannot deny: this code is a lot faster than malloc.

Now you might be saying: "Ben, that is literally the worst allocator I have ever seen" (to which I say "Hold my beer") but you have to admit: it's not that bad if you don't need all of the other stuff malloc does (like getting blocks larger than 1K or being able to call it more than once or actually freeing memory). And really fast. Did I mention it was fast?

And here we get to the pithy quotes in this otherwise very, very silly post:

You might not need the full generality and feature set of a standard solution.
and
You can write a faster implementation than the experts if you don't solve the fully general problem.
In other words, the experts are playing a hard game - you win by cheating and playing tic tac toe while they're playing chess.

All That Stuff You Don't Need

A standard heap allocator like malloc does so much stuff. Just look at this requirements list.
  • You can allocate any size block you want. Big blocks! Small blocks! Larger than a VM page. Smaller than a cache line. Odd sized blocks, why not.
  • You can allocate and deallocate your blocks in any order you want. Tie your allocation pattern to a random number generator, it's fine, malloc's got your back.
  • You can free and allocate blocks from multiple threads concurrently, and you can release blocks on different threads than you allocated them from. Totally general concurrency, there are no bad uses of this API.
The reason I'm picking on these requirements is because they make the implementation of malloc complicated and slow.****

One of the most common ways to beat malloc that every game programmer knows is a bump allocator. (I've heard this called a Swedish allocator, and it probably has other names too.) The gag is pretty simple:
  1. You get a big block of memory and keep it around forever. You start with pointer to the beginning.
  2. When someone wants memory, you move the pointer forward by the amount of memory they want and return the old pointer.
  3. There is no free operation.
  4. At the end of a frame, you reset the pointer to the beginning and recycle the entire block.
  5. If you want this on multiple threads, you make one of these per thread.
This is fast! Allocation is one add function, freeing is zero operations, and there are no locks. In fact, cache coherency isn't that bad either - successive allocations will be extremely close together in memory.

Our bump allocator is only slightly more complex than the world's worst allocator, but it shares a lot in common: it's faster because it doesn't provide all of those general purpose features that the heap allocator implements. If the bump allocator is too special purpose, you're out of luck, but if it fits your design needs, it's a win.

And it's simple enough you can write it yourself.

General Implementations are General

You see the same thing with networking. The conventional wisdom is: "use TCP, don't reinvent TCP", but when you look into specific domains where performance matters (e.g. games, media streaming) you find protocols that do roll their own, specifically because TCP comes with a bunch of overhead to provide its abstraction ("the wire never drops data") that are expensive and not needed for the problem space.

So let me close with when to cheat and roll your own. It makes sense to write your own implementation of something when:
  • You need better performance than a standard implementation will get you and
  • Your abstraction requirements are simpler or more peculiar than the fully general case and
  • You can use those different/limited requirements to write a faster implementation.
You might need a faster implementation of the fully general case - that's a different blog post, one for which you'll need a very long beard.



* The animal will be some kind of monkey flinging poop, obviously.

** Not an actual StackOverflow question.

*** Not an actual StackOverflow answer. You can tell it's not realistic because the first answer to any SO question is always a suggestion to use a different language.

**** I mean, not that slow - today's allocators are pretty good - and certainly better than I can write given those requirements. But compared to the kinds of allocators that don't solve those problems, malloc is slower.

Saturday, October 16, 2021

Is It Ever Okay to Future Proof?

A while ago I tried to make a case for solving domain-specific problems, rather than general ones. My basic view is that engineering is a limited resource, so providing an overly general solution takes away from making the useful specific one more awesome (or quicker/cheaper to develop).

Some people have brought up YAGNI, with opinions varying from "yeah, YAGNI is spot on, stop building things that will never get used" to "YAGNI is being abused as an excuse to just duct tape old code without ever paying off technical debt."

YAGNI is sometimes orthogonal to generalization. "Solving specific problems" can mean solving two smaller problems separately rather than trying to come up with one fused super-solution. In this case, there's no speculative engineering or future proofing, it's just a question of whether the sum can be less than the parts. (I think the answer is: Sometimes! It's worth examining as part of the design process.)

But sometimes YAGNI is a factor, e.g. the impulse to solve a general problem comes from an assumption that this general design will cover future needs too. Is that ever a good thing to assume?

Nostradamus the Project Manager

So: is it ever okay to future proof? Is it ever okay to have design requirements for a current engineering problem to help out with the next engineering problem?  Here are three tests to apply - if any of them fail, don't future proof.

  • Do you for a fact that the future feature is actually needed by the business? If not, put your head down and code what needs to be done today.
  • Do you know exactly how you would solve the future feature efficiently with today's code?  If not, back away slowly.
  • Would the total engineering cost of both features be, in sum, smaller if you do some future proofing now?  If not, there's no win here.
Bear in mind, even if you pass all of these three tests, future proofing might still be a dumb idea. If you are going to grow the scope of the feature at all by future proofing, you had best check with your team about whether this makes sense. Maybe time is critical or there's a hard ship date. Keep it tight and don't slip schedule.

I find the three tests useful to cut off branches of exploration that aren't going to be useful. If my coworker is worrying that the design is too specific and can't stop playing "what if", these tests are good, because they help clarify how much we know about the future. If the answer is "not a ton", then waiting is the right choice.

And I think it's worth emphasizing why this works. When solving new coding problems, one of the ways we learn about the problem space and its requirements is by writing code. (We can do other things like create prototypes and simulations, test data, etc. but writing code absolutely is a valid way to understand a problem.)

The act of writing feature A doesn't just ship feature A - it's the R&D process by which you learn enough to write feature B. So fusing A & B's design may not be possible because you have to code A to learn about B. These questions hilight when features have this kind of "learning" data dependency.

Sometimes Ya Are Gonna Need It

With that in mind, we do sometimes future proof code in X-Plane. This almost always happens when we are designing an entire subsystem, we know its complete feature set for the market, but there is an advantage to shipping a subset of the features first. When I wrote X-Plane's particle system, we predicted (correctly) that people want to use the effects on scenery and aircraft, but we shipped aircraft first with a design that wouldn't need rewrites to work with scenery.

Given how strong the business case is for scenery effects, and given that we basically already know how this code would be implemented on its own (almost exactly the same as the aircraft code, but over here), it's very cheap to craft the aircraft code with this in mind.

But it requires a lot of really solid information about the future to make this win.