InvestorsHub Logo
icon url

DDB

10/18/04 10:45 AM

#46001 RE: dacaw #46000

dacaw - I had similar observations while working on the optimization of Prime95. Such code depends on the way, how the Macro-Ops, resulting from decoding of x86-code, are grouped together and if there are enough low-latency variants (reg, reg) of these instructions. Pipeline stalling problems could arise, if instructions going to only one of the three FP pipes outweigh the other instructions by a significant amount (e.g. 20 FADD pipe ops mixed with 10 FMUL pipe ops - they will be grouped into 20 groups scheduled to the FPU reservation station with 10 of them having empty FMUL slots). With MMX code this should be somewhat easier, because many of the MMX ops can go through both FMUL/FADD pipes.


icon url

chipguy

10/18/04 11:10 AM

#46003 RE: dacaw #46000

recall that when I was developing the DCT code I ran it through AMD's CodeAnalyst (then in beta form). My DCT did a lot of its work in floating point on normal 8x8 macroblocks. MMX was used a lot too - I did all I could to make it fast but above all accurate.

If your code "did a lot of its work" in FP then why do
you say "MMX was used a lot too"?
icon url

mmoy

10/18/04 11:43 AM

#46007 RE: dacaw #46000

I didn't know that such a tool exists. I've compared IDCT
routines in FP and Scaled Integer and found that Scaled
Integer performs much better on SSE and SSE2 machines. You
can lose accuracy though but this is for displaying JPEGs
in Mozilla and I haven't heard any degradation reports from
the code nor have I seen any myself.

I'll have to take a look at this tool. I've done a lot of
work but I know that there are improvements possible and I
didn't focus that much on pipelines.

Does the tool recommend code changes or reorder instructions
for you or do you tweak, test and repeat?