News Focus
News Focus
icon url

mmoy

05/10/06 1:42 PM

#4833 RE: wbmw #4828

> Similarly for K8 wrt K7, the IMC was the right solution for
> many reasons (especially the scalability it gives to memory
> bandwidth in a multi-socket system), but for the broad range of
> workloads on the single socket client, I don't think you
> should expect more than 5-10% performance, with the rest
> coming from the many optimizations made to the K8 core.

I will most likely find out as I plan to buy a Conroe system
when they come out and can do some testing on some workloads
are considerably bigger than the caches.

But back to K7 vs K8: I've looked around to see what the
core improvements are and outside of the improvements in my
earlier post, all I've found to add to it are:

- pipeline's front-end instruction fetch and decode logic id refined to deliver a greater degree of instruction packing from the decoders to the execution pipe schedulers
- larger TLB for L1 and L2

and the second one is arguably memory related. Overall, the
list of non-memory related improvements from K7 to K8 looks
to me to be very small.
icon url

pgerassi

05/10/06 8:48 PM

#4857 RE: wbmw #4828

Wbmw:

You do not know what the percentages are for Conroe, Merom and Woodcrest as to how much can be attributed to cache. For K8 however, cache size benefits are well known as there are versions with cache totals of 256KB, 384KB, 640KB and 1,152KB for showing what the cache size benefits are for a given workload. There are also many tests that show what latency gets in benefits by using DDR with various memory timings. Furthermore tests are present show benefits of doubling bandwidth on K8.

Much has shown that the highest benefits on desktop applications are from latency reductions, followed by cache size and then bandwidth for single core desktop and mobile systems. On server type loads, latency reductions are still the highest benefits followed by bandwidth and then cache size.

By pulling out memory traffic from the HTT link in single socket systems, it makes comparisons with the FSB based Intel ones far more problematic. HTT only has to carry I/O traffic which doesn't load down the bandwidth all that much. FSB must carry both memory and I/O traffic and that causes a far greater increase in load. This affects latency due to FSB only going one way at a time and prior transactions must be complete before the next can start. HTT can read and write simultaneously. The latency is almost unaffected in the single socket scenario.

When multisockets are considered, NGA's FSB becomes a severe bottleneck and latency degredation accelerates. The tests of 75ns is where the bus is doing nothing but, memory accesses with no I/O from either core or memory accesses from the other core delaying the FSB from processing the request. This might push the latency much higher than what the synthetic tests indicate.

I find it telling that practically no tests has been publically released showing NGA's ability in heavy power user or server type workloads. Most of the multicore stuff seems to be single code multiple data where prefetching is simple and straight forward and where the two cores don't interfere too much with each other or one is idle. And 64 bit testing seems to be completely missing from the public eye.

All this basically states that NGA needs a much wider amount of public testing before it can be foreseen how well it performs vis a vis the well known K8. We will see how it does when put into the public meat grinder.

As to your dissing of the IMC and HTT versus the FSB seems to miss many advantages of that combination. Many of the advantages come from synergies of the combo of those two in concert with many others. You also seem to forget that FSB turn arounds flush the pipeline as the bus can't be turned around until all ongoing transactions are finished. This overhead seems to have slipped from your mind.

Pete