UpNDown - But wouldn't a longer pipeline imply more need for temporary storage, rename registers, etc.? Thus, a longer pipeline would need more logic gates, but with smaller per-cycle work units, allow higher frequencies?
Not necessarily. Shorter pipeline designs, by their nature, will have much more logic switching per clock cycle because more levels of logic are allowed.
One of the the most basic exercises performed during the early phases of chip design are the circuit feasibilities to determine how many levels of logic can be implemented per pipestage for a given target frequency of operation. For example, let's say you were designing a 2 GHz CPU. This yields a 500ps clock period. Let's say the process you are using yields an average CMOS logic delay of 30ps (e.g. for a 3 input NAND gate). The typical sequential element (e.g. flip flop) have an output delay of 50ps, and a input setup time of 30ps. Assuming there is no clock "skew" (divergences from one clock net to another), the math here would indicate that your design should contain, on average, 14 logic levels per pipestage.
Knowing this, you now have to design your microarchitecture to determine what can be accomplished (e.g. register file read, addition operation, etc.) in a given pipestage. From this, a design team can then proceed to determining how long the pipeline needs to be accomplish everything from instruction fetch to operation writeback and retirement.
Higher switching capacitance per pipestage is equally as determining a factor as frequency in the power consumption equation. You are basically trading off two factors that are equally influential.