Dear J3pflynn:
THe article has some glaring errors in the decode section. The 4-1-1 P-III/PM decoder does not issue 6 MOPs per cycle, but only 3. Only one decode path can do complex decodes, but it still issues one MOP per cycle. K7/K8 can decode 3 complex decodes per cycle generating 6 MOPs per cycle (3 executing and 3 load/stores). In K8, a executing MOP paired with no load/store MOP can be combined with a no executing MOP paired with a load/store MOP into a single MOP pair.
So using their terminology K7/K8 has a 4-4-4 decoder generating 2-2-2 MOPs per clock. P6 is supposed to have a 4-4-4-4 decoder, but generating only 1-1-1-1 MOPs per clock.
As to the performance estimates, 2.8GHz K8 already beats the SPECfp2000 score. A 3GHz K8 would likely still outrun Conroe in SPECfp2000. A 3GHz K8 would also likely beat Conroe's SPECint2000 score using the same compiler. Of course K10 may be out at that time with an additonal FPadd and FPmul unit to do SSE2 packed instructions at 1 FPadd_pair and 1 FPmul_pair per cycle or 4 DP flops per cycle. This will likely push K10 far out of Conroe's reach in FP and even exceed Power and Itanium SPECfp2000 scores.
Pete