Doubling the width of the 3 HTT channels would take 150 pins, out of the 267 pin delta between Sockets F and 940.
A single memory channel takes ~190 pins. (Socket 754 going to Socket 940)
I'm very curious about how much latency the coherency part of cHTT adds to a memory request between Opterons. We know that a fully coherent 1-hop request adds ~20% to latency. Part of that is due to the coherency protocol.
How much of that 20% would go away if an HTT cave was used as an optimized memory interface, not requiring the coherency protocol? Let's make it a full 32-bit HTT interface running a the full 1.4GHz HTT 2.0 spec; Aggregate bandwidth of the HTT link is 22+GB/s bidirectional vs. the 3.2GB/s unidirectional a DDR3200 module achieves, giving the HTT vast overkill in bandwidth, but the HTT link only needs 75% of the pins of a normal memory interface. Also, overall latency would be UNAFFECTED, as reads and writes could be performed simultaneously between CPU and Memory Cave (which would be capable of buffering requests and performing them optimally on the memory.)
I only used a 32-bit interface for ultimate long-term expansion capabilities, 16-bit 11GB/s bidir interfaces would leave 100% overhead on the table compared to today's solutions, with only 40% of the pincount.
So lets implement it: 7 32-bit HTT interfaces costs 1008 pins. With an "apples to apples" configuration of three links used for IO and cpu-cpu interconnect, and 4 dedicated to memory, the effect is to:
A) Double the memory bandwidth with at worst a 20% latency hit (but most likely significantly less than that) and optimal memory throughput;
B) Increase memory bandwidth headroom by 700%;
C) Decouple memory technology from the processor, simplifying the core; (I posit that 7 HTT interfaces is simpler than 3 HTT + 2 memory controllers.)
D) Increase IO/interprocessor bandwidth by 280% while decreasing latency.
E) Make memory hot-swap trivial.
F) Increases flexibility (7 homogenous HTT interfaces could be divvied up between IO/interprocess/memory as needed.)
The cost is:
A) The memory caves.
My gut instinct is that PCB complexity is a wash, with simpler cpu block linked to more complex memory blocks.
This is a platform that is PRIMED for quad core, and could last 5 years.
Additional thoughts: Let's repartition the use of the 7 HTT links:
4 for interprocessor,
2 for memory,
1 for IO
Given the enormous overhead available on the memory links, why not put dual memory controllers on the caves? You're still doubling the memory bandwidth with minimal effect on latency, but now 8-socket becomes the sweet-spot (cube topology) and each corner of the cube has a dedicated 22GB/sec IO link.
How's that for a theory?
fpg