InvestorsHub Logo
icon url

Dan3

03/30/03 2:15 PM

#1624 RE: Elmer Phud #1623

Re: What this means is that the bandwidth claims of aHT is somewhat misleading.

LOL!! You must be thinking in the context of the Intel products you work with. What error rates have you been seeing in Intel chipset transfers that lead you to think a CRC check would materially impact the bandwidth, 5%? 20%?

A CRC error on an AMD mainboard, apparently in contrast to what you see in the Intel products you work with, would be a one in many trillion occurance (caused by an errant cosmic ray, for example). It would not impact the specified bandwidth.




icon url

Tenchu

03/30/03 3:09 PM

#1625 RE: Elmer Phud #1623

EP, <CRCs must be attached to the data packet and therefore always consume bandwidth even when there is no data corruption. Corrupted data must be retransmitted, further consuming bandwidth.>

First of all, data corruption really shouldn't happen all that often. Maybe once per week, and that's if you add up all of the servers in a cluster. That's why bandwidth is irrelevant when it comes to retransmitting data. The only thing with retransmission vs. error correction on-the-fly is that the transmitter must keep a sizable FIFO buffer of all its transmitted data until the receiver can send back a "CRC OK" packet.

Second, CRC typically has a higher ratio of data bits to check bits than ECC. The reason why CRC consumes bandwidth whereas ECC doesn't is because buses with ECC add extra data lines for the check bits.

And third, ECC doesn't make sense for a variable-width packet bus like HyperTransport. The reasons why get rather complicated, but suffice to say that CRC is enough.

Tenchu

P.S. - I see Dan has already used inaccuracies to correct yours. As usual. ;-)
icon url

UpNDown

03/30/03 8:24 PM

#1633 RE: Elmer Phud #1623

Elmer, re: ECC can correct an error

We have to keep the bus usages separate. aHT is used for general transfers, I assume you don't think we should be sending out ECC on all our video writes. cHT (coherent HT) is used for cache probes and memory transfers and it might be more useful to use ECC for those.

ECC can correct some errors if you're willing to accept that there hasn't been a burst of errors that appears falsely to be a single-bit error. CRC can detect a higher percentage of errors.

A mechanism for retransmitting the data is desireable whether using ECC or CRC. For memory reads, we use ECC because the memory is a stupid device and must reconstruct the data. If it can't, we still want to be able to figure out what it should have been. But all aHT transfers are between intelligent controllers, so there is no reason to go with ECC when a retransmit would be required anyway if the data could not be reconstructed with the ECC bits delivered. In other words, even if we had ECC coding for aHT or cHT transfers, we'd still want CRC and retransmit as well in case the ECC wasn't sufficient.

You say "On a shared bus ECC is in parallel with the data". I believe the ECC bits only get as far as the memory controller, they don't make it to the system bus (FSB). For Opteron, the ECC bits get to the on-chip memory controller. From then on, if the memory line must be transmitted to another Opteron, cHT is used. So real memory errors are corrected on-the-fly by the host Opteron. It's only when another CPU requests the data that the cHT is used.

The statement: "Corrupted data must be retransmitted, further consuming bandwidth" is just FUD. We're talking about something that might happen less than once a million, billion, who cares, transfers so any increase in required bandwidth would be infinitesimal.

Finally, we get to "CRCs must be attached to the data packet and therefore always consume bandwidth even when there is no data corruption." Yes, and I don't think anybody would want it any other way. For the bulk transfers -- disk blocks, data acquisition devices, etc. -- that are needed to saturate the aHT links the percentage overhead decreases as the size of blocks increases so the bandwidth effect is also minimal.

What might be interesting to discuss is the effect of waiting for a CRC check on a full cache-line transfer between processors. The Opteron can no longer use a "most needed data first" optimization and must wait for the whole cache line plus CRC to arrive. This will show up as an increase in memory latency when accessed from other Opterons.

[Edit: I see Tenchu has taken up the banner here.]