"The quote that you linked seems to suggest a lot of rework for nothing, and I don't buy it."
They increased the number of pipeline stages by two, or 20%. That would mean a general re-balancing
of all the resources including the execution units. If not, they become a major chokepoint and a likely
target for a re-work of the core. I don't think you would find any competent design team that would do
something like that...
They changed the front end to improve IPC and also had to increase the complexity of the x86 decoding
logic for x86-64. These changes are in the part of the pipeline where the two extra stages were added.
Add that to the fact that the execution back end of an x86 processor is unlikely to be the timing critical
part of the device and there is no compelling reason to assume that there was a lot of timing slack
that needed to be re-apportioned to the pipeline past the issue stage to balance the design.
The Opteron's clock lag behind Athlon despite SOI processing seems to confirm that most of the
logic evaluation time of the two new stages was used up rather than redistributed to improve
frequency scalability.