Advanced Micro Devices Inc (AMD): Chipguy, re: Well, yes and no. See Hans...

Reply Private New

Next 10 Prev Next

Send PM Follow Ignore

Followers	1
Posts	765
Boards Moderated	0
Alias Born	08/12/2003

Ixse

Re: chipguy post# 63366

Wednesday, 10/05/2005 7:36:24 PM

Wednesday, October 05, 2005 7:36:24 PM

Chipguy, re: Well, yes and no. See Hans De Vries disection of Prescott die: It's two 32b ALUs working in conjunction for 64b calculations.

Your reply: If that was true it would have half the throughput for 64 bit integer operations as it does for 32 bit operations. I have not seen any indication of that in Intel's x86 optimization guide.

Here's how I understood it: Throughput is the same with 64b as for 32b. The double frequency 32b ALUs work in conjunction with each other. The reserve bit (as we call it in Holland; hope my translation works OK; the 33rd bit) of the lower 32b add in the first unit is past on to the second ALU that calculates the higher 32b result. The full 64b result is available in 50% more time than takes for a 32b add. Throughput is the same however, because every half clock a new 64b add can be started (same as for 32b adds). Or in other words 64b adds can be started back to back every half clock(same as for 32b adds). Note: I took an add as example only. The same goes for other simple INT calcs.

Hope my words make sense to you as I'm hardly an expert. This is how I interpreted what I read from Hans de Vries. I fully believe his analysis as it has not been contradicted once to my knowledge. I did keep an eye out for contradictions; they never appeared.

Regards,

Rink

From the links I provided:

Second integer core for 64 bit processing (not for multithreading)

It is as good as sure that the second 32 bit core is exclusively used for 64 bit processing, and in a way similar to the good old bit slices. ... The fact that makes it possible is because the core's is limited mainly to additive and logic functions. A 64 bit staggered addition will take a total of four 1/2 cycles but you can start two of them back to back on 1/2 cycle intervals. The higher part of the address is only used several cycles later to check the address tags with the TLB entries and not to access the data cache itself. What will increase with one cycle is the latency from an ALU instruction to a normal speed integer instructions. This delay will increase from 2 to 3 cycles. One extra pipeline stage is needed as well, resulting in a minor increase in the branch miss prediction penalty.

And: