*locking* and *unlocking* of the thread-safe version of putchar() was the bottleneck.
switching to unlocked putchar made the benchmark run twice as fast.
commenting out the putchar stuff entirely resulted in another factor of 2 faster.
So:
50% of time involves locking. 25% of time involved input/output 25% of time was actually doing arithmetic, calculating primes.
Gosh, I wonder if The Prescott New Instructions MONITOR and MWAIT have anything to do with the selection of this benchmark, and the performance of Nocona?