"dougSF30, what you've said is exactly what wbmw said. 1. adding pipeline stages to boost the clock frequency. 2. adding pipeline stages to remove the bottleneck in decode logic.
From my point of view 2 implies 1."
Not at all, the bottleneck could be in terms of work done per clock, not in terms of a limit to maximum clockability. For example the aim could have been to move more instructions from vector path to fast path decoding, increase the level of parallelism or shorten penalties on branch mispredictions. See the Opteron optimisation guide for details: