MMoy, Pentium M does 2 64b Int calcs per cycle, with a 3 to 4 cycle latency. Merom will 2 128b Int or 4 64b calcs per cycle, with a couple of cycles latency as well.
re: What this means is that your compiler needs to
schedule the operations to achieve the theoretical
throughput. In practice, this is hard to do because
you need to do loads and stores mixed in with your
arithmetic operations. I'm leaving out other operations
that typically need to be tossed in.
Exactly right. When only looking at int heavy programs and ignoring all other changes for now practice will show nicely evolving performance increase of e.g. between 0-25%, instead of the 100% theoretical increase. Some somewhat exotic cases increase will be more, some synthetic benchmarks too.
Other improvements (like bandwidth, latency to main mem, cache, 4 issue, higher frequency at the cost of a pipeline increase, etc..) come on top of this.
Regards,
Rink