Friday, April 27, 2007

Getting Started with WinDbg

I discovered WinDbg about a year ago and have found it to be the most powerful and complete debugger available for Windows PCs. Unfortunately, it can be a very intimidating debugger since most of the work is done via text commands. I'm going to use this blog entry as a place to dump most of my learnings about WinDbg. It will be updated quite often until I feel I've documented enough to use this as a reference to teach someone how to use it.

Setting WinDbg as the default Just-In-Time (JIT) Debugger


From the command line, run:
C:\> cd C:\Program Files\Debugging Tools for Windows
C:\Program Files\Debugging Tools for Windows> WinDbg -I

Adding Microsoft Symbol Server to the Symbols Path

Launch WinDbg, then go to File>Symbol File Path and add SRV*C:\Windows\Symbols*http://msdl.microsoft.com/download/symbols to the paths.

When you exit WinDbg, it will ask you if you wish to save changes for Workspace Base. You must select YES or it will not save your symbol path.

Monday, April 16, 2007

Musings on Optimization

First, I should say that the point of my post on the cost of virtual methods in C++ is absolutely not that virtual methods are too expensive. If I had to make a general rule, I would have to go the other way and say they're rather affordable.

I think the most important thing to understand is that, because time-to-market is valuable, and employees salaries are valuable, development time is a resource that must be deployed efficiently. Thus if you are going to spend time optimizing, it's important that you deliver real value for your work. This means two things I suppose:
  • Don't optimize code that's already fast enough.
  • Don't optimize code that is so irrelevant to your program's performance that it doesn't translate into improved user experience.
For example, by X-Plane efficiency standards, the date-time dialog box is not nearly as fast as it could be. We could do a lot to make it faster! But have we? Heck no! We have a wide variety of tricks and tools to make graphics code faster than it is, and each one of those tools takes up developer time -- to deploy it on a dialog box where no one cares how fast it is would be folly. No one cares!

Similarly we could try to optimize the performance of our menubar drawing code. The menu bar is drawn enough that it affects users. But the menu bar is such an infinitely small fraction of program performance that even if we made it 100x faster, the translation into overall program speed would be unmeasurably small.

So in that context, I can only say this about virtual functions: virtual functions are an organization tool that reduce the total time that it takes to code by giving a programmer better code-management and organization tools. Since the performance of a program is already limited by how many developer hours are available to optimize:
  • Most of the time virtual functions will outweigh their performance cost by improved productivity.
  • I suspect that if you are really worried about speed, you'll get better program performance by using "more expensive" OOP techniques in most parts of the program and using the freed-up development time to optimize the few parts that really matter.
This kind of performance is called opportunistic optimization - the idea is that you measure your program's performance and spend your time on only the parts that are both optimizable and are taking up a lot of time, thus maximizing the performance boost per developer hour. With tools like VTune and Shark getting easy-to-read high-quality data on where optimization is needed is now really easy.

(An instrumenting profiler changes your code to record timing data; an adaptive sampling profiler uses the hardware to poke at it at fixed intervals. While instrumenting profilers would seem to provide more accurate data, I don't like them because often the profiling overhead changes performance characteristics so much that you can't tell what the real problem is. This is particularly a problem for OpenGL-based applications, where half the performance is on the GPU and thus the CPU-GPU execution speed ratio matter a lot.

Adapative sampling profilers work because the most important code to fix is by definition running most of the time, so most of the samples will fall into functions you care about. For example, when profiling X-Plane we can find up to 40% of samples falling into OpenGL, mesh and OBJ draw code, which is indeed the most important part of X-Plane to tune. Sure the adaptive sampling profiler gives us really unrealistic data about the cost of drawing the menu bar, but the time spent there is so slow that we don't care.

Shark is strictly an adaptive sampling profiler. VTune comes with both types. GCC and most compiler packages also provide an instrumenting-profiler option.)

One more thought: X-Plane is fast by design -- that is, some of the things we did to have high throughput from the rendering engine were design decisions that had to be made early on. This goes against the idea of opportunistic optimization. If we were planning these ideas from day one, how did we know from running the program that they would matter? Did we just guess?

The answer is: we knew where to optimize in X-Plane 8 from the performance of X-Plane 7. That is, we use the design-induced performance limitations of a given revision to direct our development into the next version. This is just opportunistic optimization combined with code refactoring.

Tuesday, April 10, 2007

C++ Objects Part 8: The Cost of Virtual Methods

In my last post I covered dynamic casts and ambiguous base classes. This will be my last post on C++ object memory layout: the cost of using objects.

In truth I will only look at the cost of method calls - I will operate under the assumption that using a C++ object is like using a struct from a data-access standpoint, which is mostly true; you can see from past posts that virtual base classes represent containment by pointer while non-virtual base classes are contained by, well, containment, so you can imagine the costs involved.

Before getting into this, one overarching note: in my experience working on X-Plane, effective function inlining is much more important to the rapid execution of performance-critical C++ code than any of the overheads here. In particular, all of these examples assume that inlining has not occurred. The loss of inlining is a much larger cost than the additional pointer references, glue calls, etc. that we'll see.

The compiler can only inline a function when it knows exactly what will be called. Any time we have polymorphism, that goes away, because runtime data determines the call type. These examples are done with inlining and optimization off (mostly to keep the assembly code sane), but basically no call that requires a memory access before the function call to fetch an address can be inlined.

Test Cases

I wrote a stub of code to look at free functions and function pointers (to look at the cost of call-via-memory), virtual functions, and virtual base classes. In all cases I wrote a stupid stub program and let CodeWarrior disassemble the results.

Here's the C++:


static void an_func() { }

class foo {
public:
void non_virt() { }
virtual void virt() { }
};

class bar : public virtual foo { };


I have eliminated any arguments and return types, which avoids a lot of stack setup code that would otherwise make the examples harder to read in assembly. (We can't all be Larry Osterman. I agree with him 100%, but I suspect many programmers don't speak x86 as their first language.)

Here's the test environment:

foo f;
bar b;
void (* anfunc_f)() = an_func;
foo * ff = &f;
bar * bb = &b;

The optimizer is off - in real production code the optimizer may be able to share common setup costs for function calls. This exercise should demonstrate the total number of memory accesses required from a "cold start" to perform the function call.

Also please note: http://hacksoflife.blogspot.com/2007/04/wrong-side-of-war.html my x86 assembler sucks, so in some cases my explanation of what's going on may be sorely lacking.

Direct Function Calls:

Let's start with the simple case - a direct function call that hasn't been inlined.

C++:

an_func();

PPC:
0000001C: 48000001 bl .an_func__Fv
x86:
0x0000002c: E800000000 call ?an_func@@YAXXZ (0x31)


This is the simplest case - a direct function call to a known address - the destination is encoded in the program stream. In terms of my analysis I would say that this jump has a cost of "zero" memory accesses. Obviously the destination address comes from memory, but because it's in the program instruction stream, I expect that by the time we know we're going to branch, we hopefully know where too - no separate access to memory is needed.

Function Call Via Function Pointer

Before looking at this case, a brief tangent on the Mac: OS X for PowerPC, when using the CFM ABI (which happens to be the one I decompiled) uses an indirect function-pointer system called a "t-vector". Basically a function call outside of a single shared library (DLL) is called a "t-vector", and it's a pointer to the function pointer. The t-vector structure also contains additional state that is swapped in to registers as we make the call, effectively switching some runtime context per DLL. (This context is used to manage relocatable global variables - I'll comment on this more some other time.)

The reason I mention it is that a function call outside a DLL is more expensive than one inside a DLL for PPC CFM code. Why do I mention that? Because if we don't know what kind of function call we are going to make, the compiler has to use the expensive kind just in case we're jumping "long distance" into another DLL.

So starting with this case, we're going to start seeing a jump to "ptr_glue", which is a little assembly stub function that does the context switching, that is part of the C runtime.

Given the death of CFM and of the PPC on PCs, is any of this relevant? Only in this: if an ABI has both low and high cost function calls based on the amount of known information, calls through function pointers are going to involve the worst case, most conservative call mechanism (unless some kind of compiler intrinsic adds typing info to function pointers). So when we go to function-pointer calls, we may pick up some ABI overhead. In the case of CFM, the ABI overhead for a "cross-library" call is an extra memory indirection.

In this one case I'll list the PPC and Mach-O code:

C++:

anfunc_f();
PPC (CFM):
00000030: 7FACEB78 mr r12,r29
00000034: 48000001 bl .__ptr_glue
00000038: 60000000 nop

PPC (Mach-O):
00000030: 7FCCF378 mr r12,r30
00000034: 7D8903A6 mtctr r12
00000038: 4E800421 bctrl

x86:
0x00000031: FF55F0 call dword ptr [ebp-0x10] near


A few notes on this:
  • The x86 code is actually the most intuitive here - a subroutine call with indirect addressing via a memory location somewhere on the stack (or wherever ebp is aimed right now).
  • Because the PowerPC has so many registers, for very small snippets (or a very good compilers), local variables almost always happen to be in register near the time they're needed. So while you can't see the memory load in the PowerPC code (r29 and r30 just happen to have our function pointer in them) they certainly didn't get loaded for free. The compiler has basically just made my local "anfunc_f" varible a "register" variable.
  • The PPC will, under some conditions which I've forgotten and am too lazy to look up, call the function immediately after a branch. In some cases the compiler must insert a nop into this slot if it can't be used. I will omit the nops from now on, as they don't tell us much.
  • ptr_glue is our C runtime code that will make our "long distance" call. But a bl to ptr_glue with the destination address in r12 is approximately like a subroutine-call to the address stored in r12. (x86 hackers, the general purpose registers for PPC are named r0-r31 - they're all the same and all hold 32-bits on the 32-bit PPC chips).
The cost of a call to a function pointer is thus:
  • One extra memory access.
  • Any cost for a "long distance" function call based on the ABI.
That memory access might be notable because the memory access must be completed before the instruction stream can be continued. I don't know how well modern CPUs can pipeline this of course, but a cache miss could be expensive.

Non-Virtual Method

Non-virtual methods are basically free functions with some compiler renaming and an implicit pointer to the object as their first argument. (That's how they know what "this" is.) So this will look almost the same as our free function call except that we're going to see the stack get set up a little bit. These functions are fully known so they are candidates for inlining.

(Another PPC hint: arguments on the PowerPC are passed by register starting with r3 when possible.)

C++:

ff->non_virt();
PPC:
00000020: 7FC3F378 mr r3,r30
00000024: 48000001 bl .non_virt__3fooFv
x86:
0x00000034: 8B4DF8 mov ecx,dword ptr [ebp-0x8]
0x00000037: E800000000 call ?non_virt@foo@@QAEXXZ (0x3c)



So - the cost of a non-virtual method is comparable to a direct call to a free function.

Virtual Method Call, Fully Known Object Type

Virtual functions are essentially function pointers that live in a side-table. But if we make a virtual function call and the compiler is 100% sure at compile time of the type of the object, it isn't required to call through the vtable. (Calling via the vtable doesn't have any affect on program correctness as long as we end up calling the right function.)

C++:

f.virt();
PPC:
00000028: 80620000 lwz r3,f(RTOC)
0000002C: 48000001 bl .virt__3fooFv
x86:
0x0000003c: B900000000 mov ecx,offset ?f@@3Vfoo@@A
0x00000041: E800000000 call ?virt@foo@@UAEXXZ (0x46)


So the cost of a virtual method may be as low as a non-virtual method. But I wouldn't count on this. For example, CodeWarrior will generate this optimization for a virtual method defined in a base class and called on the base class, but for a reason that I am not sure of (probably I've missed some subtlety of C++) it won't call a virtual method defined in a base class and called on a derived class that does not override it.

Virtual Method Call Via Object Pointer

This is a true virtual method call - when we have a pointer to the object, we don't know what the runtime type will be, and we actually have to use those vtables.

Truly a virtual function call - since we have a ptr the compiler doesn't know the real type. Now we start to pay a little more for virtual functions.
  • Transfering r30 to r3 - this is copying the "this" pointer.
  • Copying whatever is pointed to by r3 to r12. This is dereferencing the first word of the object to get the address of its vtable. One memory access.
  • Copying whatever is pointed to by r12+8 to r12. This is going to the vtable and pulling up the 3rd ptr. (Before our virtual function is the type-id and offset info.) Two memory accesses.
  • We see pointer-glue again here. This is just the Mac's idea of a far function call.
So - the cost in total of using a virtual function via a pointer to the object is:
  • Two extra memory reads.
  • Any cost of far function calls.
So a virtual method is more expensive than a function pointer, by one memory indirection. If you wanted to optimize this out, you'd put function pointers into the object itself, and pay for the decreased memory chaining with larger object size. The vtable is a layer of indirection.

C++:

ff->virt();
PPC:
00000030: 7FC3F378 mr r3,r30
00000034: 81830000 lwz r12,0(r3)
00000038: 818C0008 lwz r12,8(r12)
0000003C: 48000001 bl .__ptr_glue

x86:
0x00000046: 8B55F8 mov edx,dword ptr [ebp-0x8]
0x00000049: 89D1 mov ecx,edx
0x0000004b: 8B31 mov esi,dword ptr [ecx]
0x0000004d: 89D1 mov ecx,edx
0x0000004f: FF16 call dword ptr [esi] near


To be perfectly honest, I'm not really sure why the x86 code looks like it does. Reading these you might think we've got more memory accesses in the x86 code, but that's due to calling conventions - the PPC code can handle the argument entirely in registers.

Either way we've got two memory fetches:
  • One to fetch the vtable ptr from the start of the object pointer.
  • One to fetch the actual function ptr, eight bytes into the vtable (skipping the type-id and offset fields, each 32-bits wide).
So the cost of a virtual function call is two memory accesses and any long function call overhead. In other words, virtual function calls cost one more memory access than function pointers. (On the other hand, we save memory by not having the function pointer in every single object.)

Virtual Method Call With Virtual Base Class

When we have a virtual base, and our pointer type is of the derived class, we pick up one more memory access. The compiler must convert our derived pointer into the virtual base type, which requires following a pointer - a memory access.

(We will need the "this" pointer to be adjusted to the base type when we make the method call. The effective argument type of the virtual function must be of the most basic class that defines the virtual function, because this is the only type the "this" pointer could have that client code could set up consistently everywhere.)

I mention the need of the "this" pointer because it occurred to me that if the vtable of a derived class contained all of its base class virtual functions, perhaps we could avoid having to cast first, then start the virtual function. (My idea was: if we could do these memory lookups in parallel, instead of fetching from the last pointer we read, we wouldn't be serialized trying to get data out of memory.)

But alas my idea doesn't work. Remember that the layout of the vtable of a class with a virtual base will be controlled by the full class, and we have no idea what full class "bb" is. So if we wanted to start providing duplicate virtual method entries in bb's vtable, we would have to duplicate every virtual function in the virtual base for every single intermediate derived class. Suffice it to say, the vtables would get somewhat unwieldy. (Speculation: this blog article emphasizes the cost of memory indirections, but perhaps the real cost is cache misses. So keeping the code small and tight and having vtables be small is probably a win. Better 3 memory fetches to cache than 2 to RAM!)

C++:

bb->virt();
PPC:
00000044: 807F0000 lwz r3,0(r31)
00000048: 81830000 lwz r12,0(r3)
0000004C: 818C0008 lwz r12,8(r12)
00000050: 48000001 bl .__ptr_glue
x86:
0x00000016: 8B45F8 mov eax,dword ptr [ebp-0x8]
0x00000019: 8B5004 mov edx,dword ptr [eax+0x4]
0x0000001c: 89D1 mov ecx,edx
0x0000001e: 8B31 mov esi,dword ptr [ecx]
0x00000020: 89D1 mov ecx,edx
0x00000022: FF16 call dword ptr [esi] near


So the cost of a virtual method call through a base class is three memory indirections and a long distance function call - one more indirection than if the base is non-virtual.

Final Thoughts: Examining the ABI

There were three techniques I used to find this information:
  1. The C++ runtime for CodeWarrior is available in source, so I was able to examine the source code and its comments.
  2. I disassembled simple code and also dumped run-time memory for simple code.
  3. In CodeWarrior, at least, you can replace run-time functions with your own code (if you know the "secret" names of the helper functions and, with the right compiler settings, get a link warning, not a link error. This allowed me to add a lot of tracing to the runtime pretty easily.
(If CW 8's debugger worked on OS X 10.4 worked, which it does not for me, I could have stepped through the code and runtime and dumped memory from there. But the debugger is really only a convenience tool for the two true debugging methods, printf and /* */.)

Monday, April 09, 2007

The wrong side of the war

I picked the wrong side...I've done a lot of work on PPC hardware, I think in Big Endian,
and I can read PPC assembly. With Apple's move to x86 hardware, how is that useful??

I thought this link provides a nice intro into stack frames and register usage in x86 assembly. This is very important - the x86 has so few registers that code that looks clean and simple on the PPC gets quite obtuse as things get juggled around. Between that, 453 addressing modes and 102,249 instructions, x86 assembler can look like the mad scribblings of Russel Crow in "A Beautiful Mind".

Any idea what this does? :-)

0x00000000: 55 push ebp
0x00000001: 89E5 mov ebp,esp
0x00000003: 56 push esi
0x00000004: 83EC04 sub esp,0x4
0x00000007: B8CCCCCCCC mov eax,0xcccccccc
0x0000000c: 890424 mov dword ptr [esp],eax
0x0000000f: C745F800000000 mov dword ptr [ebp-0x8],offset ?b@@3Vbar@@A
0x00000016: 8B45F8 mov eax,dword ptr [ebp-0x8]
0x00000019: 8B5004 mov edx,dword ptr [eax+0x4]
0x0000001c: 89D1 mov ecx,edx
0x0000001e: 8B31 mov esi,dword ptr [ecx]
0x00000020: 89D1 mov ecx,edx
0x00000022: FF16 call dword ptr [esi] near
0x00000024: 8B1504000000 mov edx,dword ptr ?b@@3Vbar@@A+0x4
0x0000002a: 89D1 mov ecx,edx
0x0000002c: 8B31 mov esi,dword ptr [ecx]
0x0000002e: 89D1 mov ecx,edx
0x00000030: FF16 call dword ptr [esi] near
0x00000032: B800000000 mov eax,0x0
0x00000037: 8D65FC lea esp,dword ptr [ebp-0x4]
0x0000003a: 8B3424 mov esi,dword ptr [esp]
0x0000003d: 8B6C2404 mov ebp,dword ptr [esp+0x4]
0x00000041: 83C408 add esp,0x8
0x00000044: C3 ret near

Friday, April 06, 2007

C++ Objects Part 7: Dynamic Casts and Ambiguous Bases

Continuing with our discussion of how C++ objects are implemented in CodeWarrior, let's look at dynamic casts when an ambiguous base class is present.

Because a dynamic cast always down-casts to the full class, and then goes back up, a cast that would be trivial for a static cast becomes more complex for dynamic cast, and can introduce ambiguous base cases that otherwise would not be an issue. But you really have to work to see this. Here is a class hierarchy that requires run-time resolution of a dynamic cast with an ambiguous base:

class R { };
class A : public R { };
class B1 : public A { };
class B2: public A { };
class D: public B1, public B2 { };

So what do we have to do to get an ambiguous base class dynamic cast?

First declare an instance of D.

Then cast a pointer to D to B1, then R. (We must cast to R via B1 because the up-cast from D to R is ambiguous - we have two equally valid copies of R within D.)

Now dynamic cast from our pointer to R to a pointer to A. What happens?

Not what you think? You might think that since A has one copy of R, we simply update the pointer and go home, but the dynamic cast must do runtime checks to make sure that we really are an R that is embedded in A. So the dynamic cast operator casts R all the way down to its full class (D) and then tries to up-cast from D to A, which is ambiguous. Oops!

The type-info structure has enough information to manage this. Every type-info structure has an array of all base classes (both direct and indirect).

An ambiguous base class has a flag, indicating that it is ambiguous (it will be listed multiple times) and an integer indicating how many of the following classes are all accessed via the ambiguous base class.

(This implies that we have to iterate over the base class array because it is not a true array - the items are variable size!)

There are a lot of conditions on using an ambiguous base class as we search for our class when looking at the list of base classes for a type during a dynamic cast. We must meet three conditions:

- We can only accept a base class if it is the same as the destination type we are casting too.
- We can only accept a base class if its offset in the struct matches the offset we used to get from our initial pointer to our full type.
- We can only accept base class if our starting type is a parent class of this particular base class.

What that means is: we can only cast to am ambiguous base directly from one of its parents. If we do, we need to make sure we have the right copy of this ambiguous base, which is why we check not only the type name, but the offset to this base from the full object (which disambiguates it.)

(How is this possible when this layout will vary from full class to full class? Well, remember that EVERY base that EVER goes into a class is part of the casting table for a type-ID, and you never have to use more than one type-id structure to perform a cast. So when we say "class D has two A's, one at 0 bytes and one at 16 bytes", that info is truly only in the typeID for class D. For some other class (also with A) we'd have a totally different set of casting info. In other words, the casting info is effectively contained WITHIN the typeID so that it can be unique to this FULL class.)

Note one case that won't cast for us:

R1
R2
A: R1, R2
B1: A
B2: A
D: B1, B2

Given obj of type D and a ptr via R1 (through B1), we can't dynamic cast from R1 to R2. Why? When we iterate through the ambiguous bases (A) we haven't STARTED on an A so we don't find the perfect match to the A from B1. Thus we fail with a bad cast!

(Is this right? I think so. As far as I can tell from the C++ spec, dynamic cast must succeed when we explicitly down-cast and the full class has some ambiguity in it, but cross-casts do not have to succeed. So casting from R1 to A is legal, but casting from R1 to R2 is not. This isn't hugely surprising, and I suspect that if your design depends on this cast, you need to reconsider the design!)