I like trace cache because
1) Why decode instructions more than once? It wastes power.
2) reduce dependence on L1 I-cache latency? Though the fact that it has to be bigger would be counterproductive if it took 1 more clock cycle to access it?
instructions can be selected differently for bundle packing each time through
I thought trace cache just held the muops for each instruction, not how they would be bundled. Maybe a trace cache (or L1 D-cache, for that matter) could be made smart to remember things like, "this instruction or muop often causes a cache miss."
Petz