dacaw - SSE3 and execution units
The existing 3 execution unit design is excellent for MMX & 3DNow but is not wide enough for SSE & SSE2, which requires a minimum of 4 units.
Well, even Intel doesn't have that many units to do their SSE/SSE2 in P4/PM. They execute full 128bit operations serially as internal 64bit operations - the same way the K8 does this.
Changing this to full 128bit/clock operation per SSE2 instruction would need so many changes to the CPU, that it is not worth to do that in the current K8 design. Watch out for the K10. According to the optimization manual we will very likely see more and/or wider execution units in future CPU designs.
But at least one performance improving fix is possible with low cost for the K8: that SSE/SSE2 data accesses from cache achieve the same bandwith like MMX or integer 64bit load operations. Currently the maximum is half of a logical 128bit reg per cycle.