One last SSE note: I went to apply SSE optimizations to mesh indexed matrix transforms. While applying some very simple SSE transforms improved throughput 15%, that gain went away when I went for a more complex SSE implementation that tried to avoid the cost of unaligned loads.
Surprising? Well, when I Sharked the more complete implementation it was clear that it was bound up on memory bandwidth. Using the CPU more efficiently doesn't help much if the CPU is starved for data.