Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
New HT 2.0 article
Just about to read it:
http://www.devx.com/amd/Article/26985
Prescott 2M Max Tc
Note too that the max allowed temperature for Prescott 2M w/ EMT64 has dropped from 72.7C to 70.7C.
Obviously the design is "tighter". Dropping the Tc-Tcase by 2C make the colling solution even harder than it is already and its already tough enough. What a nightmare.
AMD more reliable than Intel
That doyen of hardware sites, known for its impartial view on intel v. AMD (not!), running stress tests on systems.
Seems that the intel system leaves a bit to be desired:
http://www20.tomshardware.com/stresstest/index.html
They'll probably start dropping coffee on the AMD system any time now. Anything to keep their intel ad revenue.
Semiconductor Breakthrough: Processor 24 times faster
Gotta love these journalists!
On Google news no less.
http://www.earthtimes.org/articles/show/852.html
Scc on SIMD
Often you can avoid setting condition codes by xors, mins and maxes. What are you wanting to do? I generally find that the video I have worked on can avoid any jcc.
Hi mmoy - nice assembler optimization
As promised here's a nice trick you can do with assembler - its one of the best optimizations I have come across yet its not in any software optimization guide I've seen, certainly not in AMD's.
Many loops have a general form:
top:
[do_loop_body]
[test_for_some_condition, if true break]
repeat_n_times
endloop:
This is generally encoded as a test, jcc, dec, jnz
But there is a better way if the test, jcc sets & tests the carry flag.
Note that a) ja jumps if CF==0 && Zf==0
and b) dec doesn't change the CF
so you can do this:
test_condition ; sets CF if break
dec reg ; doesn't change CF
ja top_loop ; jumps if neither CF nor ZF are set
Compared to some of the points in AMD's Software Optimization Guide this is a much bigger win than most. One of the big issues in the Guide is the # of jumps in a cache line, this often helps that problem.
Enjoy!
Hi mmoy:
Took a wander over to moox and saw that you're active there.
I presume you profiled Firefox, what are the critical sections?
Do you do any assembler or is it all C stuff? If you do assembler I have some tricks you may be interested in.
All agreed here.
Yup, no question about it, everything is just hunky and there's not a cloud in the sky.
Intel's just like a bunch of happy beavers.
Whistle-while-you-work, oh yes, whistle while you work. .
Do-dah-de-dah, hmmmmm.
About credibility.
Does anyone else feel that the multitudes of statements that intel's execution problems have been "solved" leaves just a teensy-weensy bit of disbelief.
I mean it was just a few weeks ago that Barrett exploded with rage in the widely circulated memo about intel really not doing their stuff.
What happened in those few short days? I mean did the archangel Gabriel come down and annoint all the researchers. There's hardly time for any results to shore up the "its fixed" claims.
More likely a whole lot of foils have been adjusted, schedules pulled in on critical paths - whether it can be done or not.
Meanwhile there's a whole lot of senior foreign scientists doing a "Berners-Lee" (inventor of WWW, leaving MIT for grayer climes over the other side of the pond). My neighbor told me last friday that researchers were leaving Amgen in Thousand Oaks, CA in quite large numbers, to return to their countries of birth. Something to do with the meglomania coming out of DC I believe.
So its pull in the schedule, dust of the resume, call mom in Bangalore and start making preps for the trip.
Well we'll see if intel's firing on "all 8 cylinders" won't we.
On chip memory controller.
Trouble is that intel has had a lot of trouble with memory controllers.
Let's hope they don't use the same circuitry as they did in the MTH.
Its a long way from foil to product. Another recall would be rather embarassing.
"No serious developer uses a laptop"
Well I can tell you that you're wrong!
I need a 64-bit portable
If I'm developing software for AMD64 desktops or servers I need a 64-bit laptop on which to work don't I?
It's not just gamers.
"directx on servers?"
DirectX is used for more than games.
Take a look at Sun's JRE (java runtime). I just had an issue with a machine that had a Kmode_Exception on D3DX. When you ran java apps it would trigger the same exception.
I know that there's a way to tell the java compiler not to use DirectX but default is to use it.
I believe java has some use on servers, no?
This is just an example.
No DirectX on Itanic
Just going over the SDK notes for DirectX 9.0c and there's this line:
"The is no support for the IA64 bit platforms. "
Yeah, itanic takes over the world.
Now THAT was funny.
I actually guffawed! Thanks.
Nvidia SLI question
Can someone explain to me please why this Nvidia SLI (dual video card motherboard for one display) is such a great thing?
Surely it makes more sense just to make a video card with 2 GPUs?
Slightly OT: New Google service
This looks really good:
http://scholar.google.com/
Returns search results that are papers & publications.
Quote "Finally, OoO is of very little benefit for most FP intensive code."
Where do you come up with this BS?
I told you exactly my experience of writing for OOO and non-OOO procs. Do you actually read what's posted or do you just replay the party line? I stated, and I will state again, that coding for fp-intensive operations is much easier on the Athlon because the proc takes care of the micro-level optimization.
Non-OOO are a royal pain. IMHO they have no place in big-end machines because the results are so unpredictable.
One butterfly flap of the wings and the run time goes to hell.
. . . and a corollary
Thinking about it - that non-OOO fp proc are highly sensitive to micro-optimizations - it would seem reasonable to me that compiler writers would put lots of effort into the codegen for oft-used SPEC sequences.
But of course intel would never tweak benchmarks to favor their own processors, would they????
Effects of Compiler, #regs & OOO on fp performance
I've watched this piece of fud about hidden regs vs visible regs with some amusement.
With itanic the optimization is done by the compiler. Since there is no out-of-order facility in the proc there is little opportunity to optimize at run time.
I experienced this when I moved from the K6-III to Athlon in my heavily hand-tweaked assembler. The K6-III fp unit had no opportunity to do OOO execution on fp code, like the itanic. When you were in a complex section just changing the order of a couple of lines could result in pretty large speed changes. What a pain! You even had to put nops in to align code on boundaries.
In contrast Athlon optimizes the fp code at run time by moving the ops around as resources come available. Thus micro-optimization by the coder is pointless - you really don't see any difference by tweaking the odd line here or there - or sprinkling nops around.
It looks to me, from articles on the Athlon64, that the fp OOO has been improved quite a bit. Of course just having 16 SSE regs is the bees knees.
Saying the itanic's registers are "better" because they are visible is just silly. They have to be visible else the proc can do nothing worthwhile with them. Its compile-time optimizations that are the whole basis of the EPIC design. I'd rather have intelligent run time hardware that maximises the resources available.
There are lots and lots of studies that analyze the benefit of increasing the # of regs. Of course its diminishing returns. 16 seems optimal right now given software tech and hardware design.
In so many ways x86 is broken. AMD64 makes it worthwhile for the first time.
'It looks like the answer is "no".'
No, it looks like the answer is being debated.
Probably between the intel marketeers and the people who actually do things like CFD (as I used to, a lot).
My conclusion, FWIW, is that there's a lot more to running analysis than just spouting some magic number.
Infiniband, Pathscale & Clusters
Article on how Pathscale is using Hypertransport, incl. the new HTX connector, with Infiniband for cluster interconnects and the benefits therein.
http://www.devx.com/amd/Article/22534
SPEC, cache & memory
From the same site as the Opteron fp unit discussion another article about the brouhaha over spec scores:
Quote: "The SPEC 2000 benchmarks are subject to much debate in the scientific community. Are they broken? Do they just depend on memory bandwidth? Do they fit entirely in the cache? "
http://www.chip-architect.com/news/2003_08_29_Cache_efficiency_for_SPEC2000.html
Note the comment in the final para:
"The memory footprint of the SPEC2000 benchmarks is less then 200 MByte to be able to run on systems with 256 MByte DRAM. Heavier applications using multiple Gigabyte structures are likely to see much greater degradations. AMD's distributed memory solution based on HyperTransfer links is likely to pay of in these cases. A four processor 2200 MHz Opteron may reach a similar SPEC2000_rate performance as a four way 1500 MHz Itanium 2 even though the latter has a much higher single processor score. Again, larger floating point memory footprints may skew the results even further. "
Lack of 90nm in channel
These amazing power consumption figures strongly suggest that:
a) 90nm chips are going to preferred customers (SUNW especially)
b) AMD wants to clear 130nm chips out of disties
That's why, for example Reseller Mike, sees few 90nm chips coming down the line.
Low power consumption
As in the chip.de article and tomshardware bodes extremely well for dual cores next year.
AMD has got quite a process there to match its matchless design.
Large cache & real world problems.
I've done a lot of fluid dynamics. I've published (15 years ago now) in this field. When I started CFD 25 years ago it was really in its infancy.
You always start with a small number of steps (finite-difference based methods) or a small number of mesh points (finite element methods). The governing parameter is the speed of solution - it used to be the size of virtual memory, now that is relaxed on workstations.
You want to have really detailed modeling of the system -i.e. lots of points - but you always hit the limit of run time. Its no fun waiting for your solution all day to find out it blew up.
So there is no magic "ideal" working set. More is always better if you have a fast enough system.
The specfp tests have a significant working set but its a fixed set. Cynical manufacturers could make processors where all of the code+data fits on-chip. Of course when the real user then tries to extend the simulation, and the code+data spills into physical memory, he/she is going to be disappointed. There is another point where the spillage from physical->virtual (i.e.disk) kills you but physical is cheap and relatively easily extended. On-chip cache is neither cheap nor easily extended.
That's why you don't take these specfp results only at face value. Take a trip to somewhere like comp.arch and see what people think about the itanic there. They will say exactly the same thing as me: these specfp results are all about false benchmarking.
Ho ho ho specfp & itanic
Just had to laugh.
I told you how itanic does well on specfp because of its large cache, a factor that does not extend to real world scenarios.
Now intel ups the cache to 9MB and it does really, really well - and you crow about it.
Thanks for making my point even more clear.
I hope intel just keeps pushing that boat out.
"Super platinum member"
I remember the "platinum" call. My wife went all gushy over that one (she answered the phone). I think that was about $350K. That was the day that AMD hit $90+.
Balance that against the day when AMD hit $3 and she sat on the bed sobbing as I got the margin call. Then she tried to hit me with a frying pan and told me I was stupid to believe in that @@*&$#@ AM whatever D company. Fortunately I had the cash to cover the call.
Super platinum is defined by a slight swagger in the walk, an ear-to-ear grin and an i-told-you-so expression. As this is potentially a family board I can't tell you about the member.
"AMD really isn't that much different than it was 6 months ago"
Yup, that's why the stock boards have been bursting with messages and why I was accumulating at 7.6, 9.3, 12.02 (a LOT), 16.
Feels pretty good today. Waiting for my call from Schwab - you know the one "you are now a super-platinum member, congratulations"
When to sell.
No, now is NOT the time to exit, the runup has just started. A lot of people look to TA for an exit point. That's well and good but the fact here is the momentum. Just take a look at the 3 month chart. It's awsome. There will be some settling along the way but this puppy won't be fully valued until its cap > $20bn and that's a long way from here. Of course by the time it reaches $20bn we will certainly have Solaris 10 and probably Windows XP64. Then we'll be looking for $30bn.
Just my 2c
Looking at the specfp scores it doesn't seem that the change from v7.1 to v8 made that much difference (see the SGI Altix lines where you can directly compare).
There are so few results for the itanic and, strangely, very few where you can break out the effect of cache versus clock.
Looks like to me the weasel is firmly in intel's camp.
Changes to k8 fp execution unit with E0 stepping
Sorry not to get back to you on this but been busy.
I did find this summary of what to expect as SSE3 arrives in the core:
"Kevin McGrath, chief architect of the AMD "Hammer" line, recently gave a presentation at Stanford University detailing the forthcoming changes in the next revision of the Athlon 64 and Opteron. Apparently both processor lines will feature full compatibility with SSE3. In fact, it may actually be somewhat better than Intel's SSE3, as the AMD chip will dynamically translate some of the SSE instructions into operations specifically tailored to the "Hammer" design, in some cases lowering latency down to as little as one cycle. Intel's latest "Prescott" iteration of the Pentium 4 design requires many more cycles to complete the same work due to its higher clock speed and deeper pipelines."
http://www.geek.com/news/geeknews/2004Mar/bch20040303024101.htm
Look to me like a significant revamp of the fp unit.
Cache usage on specfp
Uh, your link points to memory, not cache usage. The executing portion of the program << total program size.
See how the itanic2 scales with cache (from spec site):
Dell Dell PowerEdge 3250 (1.4GHz/1.5MB, Itanium2) 1 256KB(I+D) on chip 1.5MB (I+D) on chip 1444 1444 Sep-2003
Dell Dell PowerEdge 3250 (1.4GHz/3MB, Itanium2) 1 core, 1 chip, 1 core/chip 256KB(I+D) on chip 3MB (I+D) on chip 1868 1868 Apr-2004
Dell Dell PowerEdge 3250 (1.5GHz/6MB, Itanium2) 1 256KB(I+D) on chip 6MB (I+D) on chip 1875 1875 Aug-2003
So at the same clock (1.4GHz) the fp score goes up > 400points as we double the cache from 1.5MB to 3MB!
Now increase the clock to 1.5GHz and double the cache to 6MB and the score goes up 7 (yes that's SEVEN) points.
Duh, doesn't that look like the score is highly cache sensitive with a used cache size somewhere between 1.5 and 3MB?? Once the app is bigger than the cache I suggest that the 1444 score is the accurate figure (add another seven on for the clock speed, say 1451).
At 1451 its a bit inferior to the Athlon64s as you no doubt realize. Typical results for the Opteron 150 run 1528-1644 with a 1MB cache.
Itanic2 blade
Going to be some sorry customers if they don't do their homework:
http://story.news.yahoo.com/news?tmpl=story&u=/nf/20041109/bs_nf/28270
Itanic2 does well on specfp because the spec routines fit in the on-chip cache. Most real numeric tasks are not in-cache and those specfp results don't pan out in the real world.
intel has hardly made a secret of its give-aways on itanic.
Maybe its time you faced the obvious.
Still I'm happy to see intel carrying on pushing the itanic for all its worth.
If you want a different response there's always the intel board where you could post. I'm sure you will get a more favorable audience there.
IPF
You don't think that intel has taken to seeding that market with free itanics? Its a highly visible list. I wouldn't bank on breakthrough itanic sales on the basis of this if I were you. Once intel starts actually charging for the chips I doubt we'll see many new entries.
Meanwhile Cray & Sun are just starting to put out their Opteron products.
We'll see!
Maintaining specific code
That's why I prefer a macro assembler! The relevant bits can be extracted into included files.
You make an important point about pentiums. They only thing they have in common is the branding. I have a Pentium III here, in a laptop and I could not use that to develop Willamette or Northwood code.
While it makes sense for corporations to be "all p3" or "all p4" there is no logical basis for being "all intel" except for the 'intel inside" sticker. There is a basis for using intel chipsets instead of via's but that has passed now we have HT and nForce.
Vendor specific code
That's the beauty of plug-ins. You can let the user specify what processor they have. In fact my code checks the CPUID and determines which are valid choices so a k6 user can't specify SSE for example.
I did 3 versions from one codebase:
3dnow/MMX for k6
3dnow/MMX enhanced for athlon
3dnow/SSE/MMX for athlon xp
MMX for intel chips (less accurate)
someone else did an x87 version that was as accurate as my 3dnow but much, much slower.
This was highly time critical code. An early version became part of DivX but then they went proprietary. I don't know what they use there now.
I liked 3DNow except for one thing: you couldn't specify the rounding mode and it was not defined as part of the standard.
"Maybe that wouldn't be practical."
No, I don't think it would be. The main issue is that, AFAIK, MMX is not supported for 64-bit progs. So I have to convert MMX to SSE.
A lot of the code is macro driven so, with luck, that's write once etc. It was very tight code - all I could do to avoid register spillage. Now there are 16 regs that should be less of an issue.
I also have a section that is 3DNow, now that will be interesting. This code (its freeware and distributed as part of a major freeware package) is run a lot and on large video datasets. So how many people knew that 3DNow ended up being so widely used?
I think you're mixing up two threads that I was commenting on simultaneously.
The E0 change in the fp unit has nothing to do with reliability.
I pointed out my own experience of early VIA Athlon chipsets as the reason, maybe a reason, why intel did well with the P4 when Win2000 was being adopted. It was certainly sobering for me and I reported that the virtualdub author had knowledge of similar experience. He is now using an Athlon64. Virtualdub is very important to video enthusiasts.
Any improvement in the speed of SSE processing will be very welcome in video, this from my own profiling of video-related code. It looks that the Athlon64 will become the preferred platform for video, especially as 64-bit apps arrive. I see video aas the future of home computing.
So sorry that my comments on two unrelated matters confused. Now when will I get time to port my code (assembler) to AMD64?