InvestorsHub Logo
icon url

The Duke of URL

03/13/06 12:18 PM

#2894 RE: wbmw #2892

Pete is like a guy about to be run over by a tank, screaming that he doesn't like the tread pattern. :))

icon url

pgerassi

03/13/06 12:58 PM

#2896 RE: wbmw #2892

Wbmw:

Re: NGMA still has parts of P3 in it. The 411 decoder, now 4111 still can't decode more than one complex instruction per cycle. And if the next instruction is simple, can't even decode 1 complex instruction in a given cycle. K8 can do three in any given cycle. It makes it an all around performer on widely varying code.

Wrong again, Pete. AMD essentially has 3 simple decoders. Any "complex" instructions go through the vector path and the micro-code sequencer comes up with a uop equivalent, albeit at a cost to performance. You don't even know how this works, do you?


Wrong again Wbmw! Check out this image from Hans Devries: http://www.chip-architect.com/news/Opteron_1600x1200.jpg

And look over this: http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

THree decoders each of which can decode a vector path instruction. The trouble is that the next stage can handle only three uop pairs. So making some instructions from complex to double or direct increases the throughput. NGMA only allows 4 uops. K8 has 3 uop pairs, each with a ALU/FPU and AGU operation. And in the example I made, 1 simple instruction followed by a complex instruction, AMD can do it in one cycle, but NGMA takes two. Two complexes that make a ALU/AGU uop pair followed by a ALU/NOP uop and a NOP/AGU uop pair followed by a ALU/AGU uop pair is then fused into three ALU/AGU uop pairs. NGMA still takes two cycles to do that decode to K8's one. So all in all, NGMA's decoder can be slower in many cases, a little faster in a tiny fraction of cases and likely to be slower overall.

We still don't have a lot of details for NGMA that we do for K8. The block diagram may give us clues as to how they did it, but the devil is in the details. NGMA can do 4 wide decoding, but only if the first instruction is a complex or simple followed by three simple ones. K8 can decode three complex instructions, but only can send to scheduling three uop pairs per cycle. Of course we don't have details for K8L or even what tweaks has been added to K8F.

The proof is of course actual testing without the restrictions of NDAs and the amounts seen at retail. Having a few Conroes at 2.66GHz does nothing, if the bulk are at 2GHz. You can always cherry pick one (or a few) for benchmark fests. But if Intel can only make 100s at 2.66GHz and AMD can make hundreds of thousands of 3+GHz K8Fs, Intel will not have the performance crown no matter what paid reviewers say.

Pete

icon url

CombJelly

03/13/06 3:16 PM

#2902 RE: wbmw #2892

"AMD essentially has 3 simple decoders. Any "complex" instructions go through the vector path and the micro-code sequencer comes up with a uop equivalent, albeit at a cost to performance."

If true, then this isn't a problem. Using multiple, simple instructions instead of the complex instruction equivalent has been the rule since the 486 days. So the available compilers are going to do this.