It qsort is also a very small program relatively speaking so the deleterious effect of AMD64 code size expansion won't be readily apparent with a 64 KB icache
Good point. Yes this application fits entirely within the L1 cache. My apps are generally in the 1-4MB range, and they all run better. Maybe I'm just lucky. Your theory sounds reasonable, but I have not found it in practice.
YOur Intel bias is so blatantly obvious you should both be ashamed of yourselves. "Small tight loops" as in Quicksort do NOT NEED lots of registers, so the extra registers will do no good.
The story may be very different when you are dealing with a DBMS program
Like DB2, for instance? I guess that is why 64-bit DB2 on IBM EServer 325 clusters running 64-bit Linux has the highest TPC-H test results for both 100 GB and 300 GB data sizes. And those results are almost 9 months old and haven't been exceeded. 8-CPU Xeon results are less than half as high as 8x2 CPU Opteron.