InvestorsHub Logo
Followers 21
Posts 14802
Boards Moderated 0
Alias Born 03/17/2003

Re: None

Thursday, 10/08/2015 12:17:57 PM

Thursday, October 08, 2015 12:17:57 PM

Post# of 151657
Interesting paper comparing Sandy Bridge and Bulldozer cache and
memory architectures and measured performance.

http://www.noahmendelsohn.com/COMP40Slides/2014%20Memory%20Performance%20Paper.pdf

From their conclusions section:

We find the AMD Bulldozer architecture with its module concept
and two independent dies per socket to be much more complex
than Intel’s Sandy Bridge design, creating a vast amount of different
latency and bandwidth numbers. While latency figures are
mostly in line with our expectations, several observed bandwidths
are surprisingly low. The accumulated L3 cache bandwidth of a
full Bulldozer die (8 cores) is close to the L3 bandwidth of a single
Sandy Bridge core. The L3 cache bandwidth also scales better with
the core count on the Intel system. Although AMD’s L2 cache is
very large, its performance is only on par with Intel’s L3 cache in
a per-core comparison. The accumulated L3 bandwidth of a Bulldozer
socket exceeds the main memory bandwidth only by a factor
of two, compared to more than a factor of five on the Intel system.
This is even more noteworthy knowing that the Sandy Bridge system
is also superior in terms of main memory bandwidth per socket.
While both interconnect technologies fail to fully utilize the memory
bandwidth of other NUMA nodes, the HyperTransport results
are much more disappointing. The transfer rate between the sockets
in the Intel system is four times higher than the transfer rate
between the two dies within the AMD processor and more than ten
times more effective than some of the two-hop connections in the
AMD topology
. Finally, on-die latencies are much better on Sandy
Bridge, mostly due to the inclusive L3 cache design.

Overall, we attribute a significant portion of Intel’s current advantages
regarding application-level per-socket performance to the
differences in the memory hierarchy. The L3 cache provides a
high bandwidth per core that also scales linearly with the amount
of cores. The QuickPath interconnect also provides a relatively
high bandwidth for remote memory accesses. In contrast, AMD’s
memory subsystem severely limits the achievable processing power
of the compute units in memory-intensive applications. Furthermore,
parallel programs need to be exceedingly NUMA-conform to
avoid being limited by the unexpectedly low HyperTransport performance
for certain connections.
Volume:
Day Range:
Bid:
Ask:
Last Trade Time:
Total Trades:
  • 1D
  • 1M
  • 3M
  • 6M
  • 1Y
  • 5Y
Recent INTC News