In truth I will only look at the cost of method calls - I will operate under the assumption that using a C++ object is like using a struct from a data-access standpoint, which is mostly true; you can see from past posts that virtual base classes represent containment by pointer while non-virtual base classes are contained by, well, containment, so you can imagine the costs involved.
Before getting into this, one overarching note: in my experience working on X-Plane, effective function inlining is much more important to the rapid execution of performance-critical C++ code than any of the overheads here. In particular, all of these examples assume that inlining has not occurred. The loss of inlining is a much larger cost than the additional pointer references, glue calls, etc. that we'll see.
The compiler can only inline a function when it knows exactly what will be called. Any time we have polymorphism, that goes away, because runtime data determines the call type. These examples are done with inlining and optimization off (mostly to keep the assembly code sane), but basically no call that requires a memory access before the function call to fetch an address can be inlined.
Test Cases
I wrote a stub of code to look at free functions and function pointers (to look at the cost of call-via-memory), virtual functions, and virtual base classes. In all cases I wrote a stupid stub program and let CodeWarrior disassemble the results.
Here's the C++:
static void an_func() { }
class foo {
public:
void non_virt() { }
virtual void virt() { }
};
class bar : public virtual foo { };
I have eliminated any arguments and return types, which avoids a lot of stack setup code that would otherwise make the examples harder to read in assembly. (We can't all be Larry Osterman. I agree with him 100%, but I suspect many programmers don't speak x86 as their first language.)
Here's the test environment:
foo f;
bar b;
void (* anfunc_f)() = an_func;
foo * ff = &f;
bar * bb = &b;
The optimizer is off - in real production code the optimizer may be able to share common setup costs for function calls. This exercise should demonstrate the total number of memory accesses required from a "cold start" to perform the function call.
Also please note: http://hacksoflife.blogspot.com/2007/04/wrong-side-of-war.html my x86 assembler sucks, so in some cases my explanation of what's going on may be sorely lacking.
Direct Function Calls:
Let's start with the simple case - a direct function call that hasn't been inlined.
C++:
an_func();
PPC:
0000001C: 48000001 bl .an_func__Fv
x86:
0x0000002c: E800000000 call ?an_func@@YAXXZ (0x31)
This is the simplest case - a direct function call to a known address - the destination is encoded in the program stream. In terms of my analysis I would say that this jump has a cost of "zero" memory accesses. Obviously the destination address comes from memory, but because it's in the program instruction stream, I expect that by the time we know we're going to branch, we hopefully know where too - no separate access to memory is needed.
Function Call Via Function Pointer
Before looking at this case, a brief tangent on the Mac: OS X for PowerPC, when using the CFM ABI (which happens to be the one I decompiled) uses an indirect function-pointer system called a "t-vector". Basically a function call outside of a single shared library (DLL) is called a "t-vector", and it's a pointer to the function pointer. The t-vector structure also contains additional state that is swapped in to registers as we make the call, effectively switching some runtime context per DLL. (This context is used to manage relocatable global variables - I'll comment on this more some other time.)
The reason I mention it is that a function call outside a DLL is more expensive than one inside a DLL for PPC CFM code. Why do I mention that? Because if we don't know what kind of function call we are going to make, the compiler has to use the expensive kind just in case we're jumping "long distance" into another DLL.
So starting with this case, we're going to start seeing a jump to "ptr_glue", which is a little assembly stub function that does the context switching, that is part of the C runtime.
Given the death of CFM and of the PPC on PCs, is any of this relevant? Only in this: if an ABI has both low and high cost function calls based on the amount of known information, calls through function pointers are going to involve the worst case, most conservative call mechanism (unless some kind of compiler intrinsic adds typing info to function pointers). So when we go to function-pointer calls, we may pick up some ABI overhead. In the case of CFM, the ABI overhead for a "cross-library" call is an extra memory indirection.
In this one case I'll list the PPC and Mach-O code:
C++:
anfunc_f();
PPC (CFM):
00000030: 7FACEB78 mr r12,r29
00000034: 48000001 bl .__ptr_glue
00000038: 60000000 nop
PPC (Mach-O):
00000030: 7FCCF378 mr r12,r30
00000034: 7D8903A6 mtctr r12
00000038: 4E800421 bctrl
x86:
0x00000031: FF55F0 call dword ptr [ebp-0x10] near
A few notes on this:
- The x86 code is actually the most intuitive here - a subroutine call with indirect addressing via a memory location somewhere on the stack (or wherever ebp is aimed right now).
- Because the PowerPC has so many registers, for very small snippets (or a very good compilers), local variables almost always happen to be in register near the time they're needed. So while you can't see the memory load in the PowerPC code (r29 and r30 just happen to have our function pointer in them) they certainly didn't get loaded for free. The compiler has basically just made my local "anfunc_f" varible a "register" variable.
- The PPC will, under some conditions which I've forgotten and am too lazy to look up, call the function immediately after a branch. In some cases the compiler must insert a nop into this slot if it can't be used. I will omit the nops from now on, as they don't tell us much.
- ptr_glue is our C runtime code that will make our "long distance" call. But a bl to ptr_glue with the destination address in r12 is approximately like a subroutine-call to the address stored in r12. (x86 hackers, the general purpose registers for PPC are named r0-r31 - they're all the same and all hold 32-bits on the 32-bit PPC chips).
- One extra memory access.
- Any cost for a "long distance" function call based on the ABI.
Non-Virtual Method
Non-virtual methods are basically free functions with some compiler renaming and an implicit pointer to the object as their first argument. (That's how they know what "this" is.) So this will look almost the same as our free function call except that we're going to see the stack get set up a little bit. These functions are fully known so they are candidates for inlining.
(Another PPC hint: arguments on the PowerPC are passed by register starting with r3 when possible.)
C++:
ff->non_virt();
PPC:
00000020: 7FC3F378 mr r3,r30
00000024: 48000001 bl .non_virt__3fooFv
x86:
0x00000034: 8B4DF8 mov ecx,dword ptr [ebp-0x8]
0x00000037: E800000000 call ?non_virt@foo@@QAEXXZ (0x3c)
So - the cost of a non-virtual method is comparable to a direct call to a free function.
Virtual Method Call, Fully Known Object Type
Virtual functions are essentially function pointers that live in a side-table. But if we make a virtual function call and the compiler is 100% sure at compile time of the type of the object, it isn't required to call through the vtable. (Calling via the vtable doesn't have any affect on program correctness as long as we end up calling the right function.)
C++:
f.virt();
PPC:
00000028: 80620000 lwz r3,f(RTOC)
0000002C: 48000001 bl .virt__3fooFv
x86:
0x0000003c: B900000000 mov ecx,offset ?f@@3Vfoo@@A
0x00000041: E800000000 call ?virt@foo@@UAEXXZ (0x46)
So the cost of a virtual method may be as low as a non-virtual method. But I wouldn't count on this. For example, CodeWarrior will generate this optimization for a virtual method defined in a base class and called on the base class, but for a reason that I am not sure of (probably I've missed some subtlety of C++) it won't call a virtual method defined in a base class and called on a derived class that does not override it.
Virtual Method Call Via Object Pointer
This is a true virtual method call - when we have a pointer to the object, we don't know what the runtime type will be, and we actually have to use those vtables.
Truly a virtual function call - since we have a ptr the compiler doesn't know the real type. Now we start to pay a little more for virtual functions.
- Transfering r30 to r3 - this is copying the "this" pointer.
- Copying whatever is pointed to by r3 to r12. This is dereferencing the first word of the object to get the address of its vtable. One memory access.
- Copying whatever is pointed to by r12+8 to r12. This is going to the vtable and pulling up the 3rd ptr. (Before our virtual function is the type-id and offset info.) Two memory accesses.
- We see pointer-glue again here. This is just the Mac's idea of a far function call.
- Two extra memory reads.
- Any cost of far function calls.
C++:
ff->virt();
PPC:
00000030: 7FC3F378 mr r3,r30
00000034: 81830000 lwz r12,0(r3)
00000038: 818C0008 lwz r12,8(r12)
0000003C: 48000001 bl .__ptr_glue
x86:
0x00000046: 8B55F8 mov edx,dword ptr [ebp-0x8]
0x00000049: 89D1 mov ecx,edx
0x0000004b: 8B31 mov esi,dword ptr [ecx]
0x0000004d: 89D1 mov ecx,edx
0x0000004f: FF16 call dword ptr [esi] near
To be perfectly honest, I'm not really sure why the x86 code looks like it does. Reading these you might think we've got more memory accesses in the x86 code, but that's due to calling conventions - the PPC code can handle the argument entirely in registers.
Either way we've got two memory fetches:
- One to fetch the vtable ptr from the start of the object pointer.
- One to fetch the actual function ptr, eight bytes into the vtable (skipping the type-id and offset fields, each 32-bits wide).
Virtual Method Call With Virtual Base Class
When we have a virtual base, and our pointer type is of the derived class, we pick up one more memory access. The compiler must convert our derived pointer into the virtual base type, which requires following a pointer - a memory access.
(We will need the "this" pointer to be adjusted to the base type when we make the method call. The effective argument type of the virtual function must be of the most basic class that defines the virtual function, because this is the only type the "this" pointer could have that client code could set up consistently everywhere.)
I mention the need of the "this" pointer because it occurred to me that if the vtable of a derived class contained all of its base class virtual functions, perhaps we could avoid having to cast first, then start the virtual function. (My idea was: if we could do these memory lookups in parallel, instead of fetching from the last pointer we read, we wouldn't be serialized trying to get data out of memory.)
But alas my idea doesn't work. Remember that the layout of the vtable of a class with a virtual base will be controlled by the full class, and we have no idea what full class "bb" is. So if we wanted to start providing duplicate virtual method entries in bb's vtable, we would have to duplicate every virtual function in the virtual base for every single intermediate derived class. Suffice it to say, the vtables would get somewhat unwieldy. (Speculation: this blog article emphasizes the cost of memory indirections, but perhaps the real cost is cache misses. So keeping the code small and tight and having vtables be small is probably a win. Better 3 memory fetches to cache than 2 to RAM!)
C++:
bb->virt();
PPC:
00000044: 807F0000 lwz r3,0(r31)
00000048: 81830000 lwz r12,0(r3)
0000004C: 818C0008 lwz r12,8(r12)
00000050: 48000001 bl .__ptr_glue
x86:
0x00000016: 8B45F8 mov eax,dword ptr [ebp-0x8]
0x00000019: 8B5004 mov edx,dword ptr [eax+0x4]
0x0000001c: 89D1 mov ecx,edx
0x0000001e: 8B31 mov esi,dword ptr [ecx]
0x00000020: 89D1 mov ecx,edx
0x00000022: FF16 call dword ptr [esi] near
So the cost of a virtual method call through a base class is three memory indirections and a long distance function call - one more indirection than if the base is non-virtual.
Final Thoughts: Examining the ABI
There were three techniques I used to find this information:
- The C++ runtime for CodeWarrior is available in source, so I was able to examine the source code and its comments.
- I disassembled simple code and also dumped run-time memory for simple code.
- In CodeWarrior, at least, you can replace run-time functions with your own code (if you know the "secret" names of the helper functions and, with the right compiler settings, get a link warning, not a link error. This allowed me to add a lot of tracing to the runtime pretty easily.
No comments:
Post a Comment