dacaw - I had similar observations while working on the optimization of Prime95. Such code depends on the way, how the Macro-Ops, resulting from decoding of x86-code, are grouped together and if there are enough low-latency variants (reg, reg) of these instructions. Pipeline stalling problems could arise, if instructions going to only one of the three FP pipes outweigh the other instructions by a significant amount (e.g. 20 FADD pipe ops mixed with 10 FMUL pipe ops - they will be grouped into 20 groups scheduled to the FPU reservation station with 10 of them having empty FMUL slots). With MMX code this should be somewhat easier, because many of the MMX ops can go through both FMUL/FADD pipes.