Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Wbmw:
You can have all the systems available to order, if the customers don't want them, they aren't sold. There are very few servers sold prior to extensive testing and trials. You are talking hallucentory drugs like Angel Dust to hope for large orders of untested and unvetted servers. It took Opteron a year, even when sold by OEMs, to get significant orders. Woodcrest will take similar times, except when having known real problems. Then it can take far longer, if ever.
Just because HP "tested" them, they didn't do it in your environment, use your specific software doing things you need done your way. Look at the GAO experience. You think IBM didn't test it before giving it to GAO for their testing? Look at Intel with the P3-1.13GHz. They ran it through all their testing and even launched it. They never tried simply doing a Linux kernel compile. They pulled it because they could reproduce the problem. And how about the infamous FDIV bug?
These past examples are a very good reason to try any server in your particular environment for a long period before getting it in large quantities. Most IT managers hear horror stories or have been burned themslves when not following that rule. It hard enough to do their jobs, even when the equipment is rock solid whatever was thrown at it, to deal with equipment problems.
Contrary to your recent belief that just having a benchmark king will cause an immediate switch in purchasing especially to very conservative server buyers, the real world doesn't work that way. This helped the old Xeons keep large market share when Opteron came out, but now it will keep Opterons being purchased in vast quantities while Woodcrest is being tested. Any problems during that testing, Woodcrest gets written off as being too unstable. Even if they had no problems, extensive rumors from others can write off Woodcrest for a company.
That was the reason for the long Opteron testing period prior to launch and the likely reason CPU using with Socket S1 (Turion & X2) and AM2 (Sempron, A64, X2 & FX) are already out while Socket F Opterons are still to launch. They are making sure that even hints of problems are gone by the time it launches. Given the above, its the prudent thing to do.
As for 8 socket and up, Opteron is more likely to defeat Itanium than Power. With the red ink looming at Intel, I6tanium is not long for this world. And the FSB is going to be more of an anvil around Xeon's neck as core count goes up, irrespective of sockets. And more sockets require more FSBs else they will get absolutely dismal scores with the current dual FSB especially as core counts go up. More FSBs means far larger pad counts on the chipset dies, thus making them very large dies and reduced production and yields.
Server consolidation will favor the larger way (socket) servers. It is far better to replace 20 old Xeon duals with a single 4 socket server, than multiple dual socket servers (needed because the small amount of memory and storage in the latter wrt the former. Look at the Sun x4600 and x4500 for good examples compared to the dual socket Woodcrest servers. The Suns have more memory, disk and I/O than those Woodcrest HP boxes. And most servers are I/O bottlenecked than CPU light.
As for the price cuts, AMD is waterfalling CPU bins in their price list soon as well. If Intel sells their sweet spot for P4s and Celerons at 1/3 their current price, they will slash their ASPs and revenue by 1/2 to 2/3rds as well. Intel bleeds dry quickly at 2-2.5 billion a quarter in CPU revenue at $50 to $75 ASPs. AMD is moving up their sweet spots in the various lines. Putting them closer to the real distributions seen in production. Which will greater stablize their ASPs.
They were selling a A64 3500+ capable CPU as a 3000+, and a 3800+ capable CPU as a 3200+. Now they will get the same prices for the same CPUs, but the public gets two guaranteed speed grades for the same money. Ditto for the X2s. The larger cache versions weren't justified in their higher prices given the lower KGD per processed wafer. They make more money per wafer area with the lower cache versions.
Intel OTOH isn't down binning much even with 65nm. They got lower power, but not higher speed at worst case conditions. They have too many CPUs in the lower bins. Yes, instead of the specified 85W TDP cooler, people putting monster 200W TDP coolers can get higher speeds, but Intel dpoesn't want the bad publicity of specfying the 200W TDP cooler, lower ambient temperatures and higher voltages (and living with the higher failure rate and lower stability) normally done in overclocking. Given the tighter specs, people would only be willing to pay so much less for them (they couldn't overclock them much, if at all), that the outcome is the same anyway.
No they have to dump them at firesale prices and they have to make them as they won't get enough revenue to survive without them as Yonah, Dothan and MCW don't make the revenue needs of Intel. If they would be willing to take the loss of face and monopoly status, they simply could stop making P4s and Celerons and use the resulting shortage to make high profits with their remaining P4 and Celeron Inventory. Yes, that would also have AMD making scads of money, but they could form a duopoly and both could make huge profits. But Intel likes being a monopoly too much. They are willing to slit the stockholder's throat to keep it. When it becomes their throat wil they start to see reason.
Prepare for your very rude awakening when Intel reports. Do try and sift through their report with a fine tooth comb. If it looks too good to be true, it probably is. Intel's Q3 guidance will also open your eyes as well and drop your jaw. You'll have to pick it off the ground after.
Pete
Wbmw:
Where have you been? Stuck your head in the sand? Opteron is on a tested tried and true platform. Woodcrest just started the needed testing. It'll be done in 12 months. Till it happens, only samples will be bought. So far though, Woodcrest failed to get GAO certification. That typically means going back to the drawing board. So the only Xeon servers that went through the vigorous testing are having theirt chronometers cleaned by Opteron. Only IBM's Hurricane glue gives them any chance. When similar glue is given Opteron, even Hurricane boosted Xeons get creamed.
And Woodcrest only is good for 2S. It just doesn't scale. Opteron is good up to 8S and soon will go to 32S glueless. With glue, 40KS isn't out of the question. So Opteron has the high ground in AMD64 wrt Woodcrest.
As to the benchmarks, here is what Intel would like you to use, price/performance (lower is better): http://www23.tomshardware.com/cpu.html?modelx=33&model1=262&chart=118&model2=214
Sempron 3400+, at the time of testing, gets 11.22, while P4 550 gets 31.34., almost three times as much on a 50/50 games applications mix heavily weighted towards encode/decode and games that P4 does better in. The highest available Celeron, the 3.06GHz 256KB 346D is much slower than a P4 550 justifying its lower price. Of course the best one on that list is the Sempron 2600+ at 7.60. The best 939 A64 is the 3000+ Venice (1.8GHz 512KB) at 10.71. Another thing to notice is that while a Thunderbird and some AXPs were tested, no Celerons appear on that list. That must tell you how undesirable they are.
And you have nothing on what the total volume Intel will sell in Q3 and Q4. 10-15% of that will not be 4-6 million, but considerably less. While K8 DC is already above 15% of AMD's production. What NGA can't afford is needing another spin (B-3). A three month delay would about do Intel management in (otellini's head would roll).
Pete
Wbmw:
Sempron 3400+ is more than a match for Celeron D 356. Its quite a bit faster than a Pentium 4 3.4GHz even on Socket 754: http://www23.tomshardware.com/cpu.html?modelx=33&model1=262&chart=68&model2=214
The socket AM2 version is even faster. And oh, by the way, Pricewatch has no ads for the Celeron D 356. There are plenty of Sempron 3400+ in both 754 and AM2 flavors. So if you can get only Celeron D 346s (and slower), it doesn't matter what Intel prices the faster version at. The sweet spot Celeron D gets killed by the sweet spot Sempron. The sweet spot Sempron may be faster than most of the P4s including their sweet spot. And thats why most of the Celeron and P4 production has to be sold at prices well below that of the average Sempron.
And why the pecking order is Opteron->Woodcrest->Conroe->A64->Sempron->P4->Celeron. And with Woodcrest and Conroe in tiny volumes (even less than Opteron), Intel ASPs will fall even in Q3 and Q4.
Pete
Chipguy:
You made sweeping statements sans any constraints or caveats. Then you state that you assumed that people place some normal caveats. But you ignore other normally applied caveats. So the other debaters must psychically figure which caveats you used, others you dismissed and still others you missed. Unless you specifically state which constraints and caveats you use, people will assume you used none at all. Having them implicit simply gives those that use them too much wiggle room.
How can any designer miss that processes change over time? How can any specifier of chips miss that worst case means no temperature change? How can anyone not realize that you must compare competitors using the same basis? There are so many things you don't seem to be able or want to grasp.
Given the above, you do not see the big picture. You might see a very small piece of that big picture, but can't extrapolate that little piece back into the big picture. Since many of the big picture's aspects seem to slide right past your grasp, you are ignorant. Given your attitude, you plan to stay that way.
Pete
Chipguy:
With your holier than thou posts, you consistently make statements that are not backed up by facts. When I call you on them, you change the argument. YOu are the one who stated that the Vdd voltage is set top the minimum. Now you claim that it is set to the minimum plus a guard band.
You state that subthreshold leakage goes up exponentially by temperature forgetting the facts that one, a 30W TDP rise causes the temperature on a good HSF to go up by 6-8K. That increase is minimal wrt 342K worst case case temperature. THat when you take into account the typical K4 and K5 constants in the equation you have, change the results by very small amounts compared to the endpoint values. And since the worst case temperatures don't change, cause zero change wrt AMD's TDP.
You also imply that AMD's processes do not improve over time. You do not take into account the part of AMD designer's and engineer's arguments with facts to back them up, speed up the transistors within a process over time. That a process that generated a 2.6GHz DC Opteron 95W worst case TDP in Q4 of last year can't be sped up at all by Q4 of this year. Others say its quite possible with the planned 4 transistor improvements this year to get the 30% speed increase to 3.4GHz. Some even allowing the worst case TDP to go up to 125W to get there.
No, in your infinite wisdom, AMD just can't improve their process at all. Niether can IBM. But, Intel can improve theirs by leaps and bounds. But Itanium didn't improve more than 5-7% in speed over three years. P4 didn't improve more than 15% in speed over two years and that included two process shrinks. P-M didn't improve more than 13% in speed over two years including two process shrinks. Perhaps it is Intel that can't improve much and you want to think that AMD and IBM are stuck in the same position. Perhaps their different systems and philosophies allow them to do much better. Their history so far has proven you wrong on many counts.
I didn't miss the fact of a guard band, but you forgot to state it. You forgot that temperature isn't a factor when using AMD's worst case TDPs. You forget that processes improve over time. Most with a clue remember those things. Part of knowing is what must not be included, what can be ignored and what must be included. That is the knowledge you lack in these discussions. And that is more crucial than knowing facts and figures.
Thus you may know something about chip design, but can't apply that knowledge to other aspects of the business. Many of your on high pronouncements in the past have proven incorrect either because you ignored crucial information or placed a constraint that was illusionary.
Yes I do speculate. You have to because so much of the needed information is not available publically. And because that requires stating the rationale behind that speculation, the posts do get long. The problem is that when I don't include it, people complain and ask how I got that. So I err on saying more.
Pete
Chipguy:
First you said that "the processors operating voltage is set to the minimum for proper operation". That was wrong occording to that shmoo plot. Voltage is set between 16 and 22% above minimum operating voltage (1.15-1.10V wrt). Second, you claimed that to go higher, Vdd would have to be raised. Yet to go 7% higher or 2.8GHz (given the percentages from the real numbers that would have been used), Vdd could be dropped to 1.3V.
Lastly, that data was taken well before 2/17/2006. Likely in Q4 of 2005. There has been 2-3 upgrades in transitors since then. 7% gets it to 2.8GHz and 14% gets it to 3GHz. Since SiGe is supposed to net a 30% increase in speed, that pushes it to 3.4GHz at the same power. And thats before any increases due to the transition to 65nm.
So WTF do you know? This old shmoo plot has already proven you wrong twice.
Pete
Wbmw:
Temperatures are not measured in Celius or Fahrenheit, but Kelvin or Rankin. So its not a 20% increase going from 50C to 60C, but only a 3% rise going from 323K to 333K. A 100W TDP rise on a good cooler is only 20K rise. So a 60W TDP rise we were talking about translates to a 12K rise or about 4%. Even if you cube it, that's only 12.5% rise for a 30% clock boost.
You also forget that AMD is using a thicker oxide than Intel and have at each process generation so far and thickness is a exponential effect. A 10% greater oxide thickness translates at the same voltage and temperature into an 0.5 to 1 order of magnitude decrease in gate leakage. Thus AMD tends to have a higher percentage of dynamic power of the total dissipation than Intel.
Lastly, since AMD uses worst case for its TDP ratings, temprature doesn't vary at maximum. This removes temperature effects as far as AMD TDP ratings go.
Pete
Chipguy:
Voltage is not set to the minimum to operate reliably. My A64 3500+ 69W TDP family with a specific 50W TDP rating, is set for 1.35V. It runs at 2.2GHz at 1.15V and thats an upper bound since I did not try lower. Its running 2.43 GHz right now (a 10% overclock) rock solid stable at default voltage. I haven't tried higher so far.
So your 10% over stock for 10% frequency increase is just baloney. Thess simple empirical tests refute your theories. Experiment always trumps theory.
The opther problem you have is you fail to take into account that engineering samples built many months ago do not show what current stuff does. Intel sticks with the same exact process over many quarters (copy exact and all that entails). AMD uses APM and can make many adjustments over time knowing which machines do the best job at making high speed, low leakage or high yield. Newer product seems to do far better on average than older product.
So a FX-62 made in January will be 2-3 improvements behind one made in July. With each improvement getting 5-10% more clock, July FX-62s might get anywhere from 10% to 30% more clock. 20% * 2.8GHz is 3.36GHz. Lastly A64 FX-64s might already be picked out for introduction later in this quarter and might all have the faster top speeds within.
Pete
Dear Combjelly:
A good test would be if Dempsey has the same problems as Woodcrest. They both fit on the same MB. If Dempsey passes and Woodcrest doesn't, it may be a problem with NGA. Then one of three things is the likely culprit, the L2 cache arbiter, the prefetchers and the FSB.
RAID controllers generally use the biggest blocks used in transfers, 4KB and up. It first goes to the cache in the RAID controller and then gets DMA'ed to the memory. Now that goes on the FSB to be cached (although device drivers should tell the CPU that that area shouldn't be cached). Now if the FSB is at fault, the massive data block (larger than even TCP/IP max blocks of 1512 bytes) might create enough noise to have some data misread every now and then. Dempsey with its 1066MHz FSB might be clean enough at the lower speed to not have the occasional data misread. Thus it doesn't have the problem.
Two, virtual memory swaps pages in and out of the RAID HDs. What if the prefetchers fetch data that will be replaced and then get acted on before the correct data gets swapped in. This kind of race condition would be hard to find. And it will only be seen when the system is under high load with not enough memory to hold all of the running programs. Per haps its the very aggresive nature of the prefetchers that gets NGA into trouble.
Three, the cache arbiter may cause core one to use what core two cached in L2. Core two may be in the process of invalidating that cache line. Granted that is a small time window, but the problem doesn't happen in either exclusive caches or independent ones.
In any case, NGA just shot itself in the foot. If this is the case, Intel just pulled a bonehead maneuver on the par of the Linux compile bug of the P3-1.13. This time they compounded it by dissing their P4s causing them to be unwanted. Oh the irony!
Pete
Wbmw:
In that area, you don't know! Your protestations to the contary, just emphasize that. You are better off without the insults, quips and simply listen to others who do know.
Last post on this subject.
Pete
Chipguy:
No you are another one who shows his ignorance with quick quips. Your non-intel ignorance knows no bounds.
Pete
Tecate:
From Sunday's ads, you're more wrong, both in phrasing and in fact.
Pete
Wbmw:
Well you just opened your mouth and showed how ignorant you are.
Pete
Tecate:
HP has stated that they make more margin from their AMD lines than their Intel ones. Given that, they are more likely to buy even more AMD as it becomes available.
And this may not change even if P4s are priced real low, as the computers they are placed in lose more value than what HP pays less for the CPUs. Its a loss to HP if they pay $60 less for the CPU but, the computer that it is placed in loses $150 of its price. Even if Intel gave them away for free, HP still loses money on the "Deal".
Pete
Dear Greg:
Here is a bunch of stuff on anti-trust and anticompetitive type cases. In one case 77% of marketshare was found to be a monopoly to a competitor of 13%. Intel has had numbers quite a bit higher in the periods that the lawsuit covers. And yes, that 13% competitor was awarded $1.05 billion in trebled damages.
It is also interesting what the penalties on both companies and individuals are. The US is known for throwing the individuals responsible in prison.
http://www.oecd.org/dataoecd/11/53/34427452.pdf
Pete
Wbmw:
I don't respond to people who obviously do not know what they are talking about. And that "nonsense" is yet another thing in which Intel will be forced to follow AMD's lead.
Pete
PS: Here is a recent reference for your edification: http://www.pureoverclock.com/article37-2.html
Unfortunately for you, there are many more.
Wbmw:
Obviously you are saying that AMD itself is lying when asked about this. You obviously think you know more than the designers at AMD with their intimate knowledge of the way things are done within AMD die designs. They have stated that they use a cache design that allows them to scale the cache size on a as needed basis. So they could with the same basic die design make nx1MB L2 dies for servers, nx512KB L2 dies for desktops and mobiles and nx256KB L2 dies for value desktop and notebooks. All would be still 16 way Set Associative with different number of "sets" in each size.
As to the I/O ring on the outside, there are two areas that looks to be a straight section of buses only that can be shrinked or expanded as needed. Thus the different cache sizes can be accommodated with little change to the basic die masks. Besides while Intel uses "custom" routing to save space which also needs much more engineering labor to do die alterations, AMD uses reusable modular cells and mostly automated routing (some critical areas have manual assist IIRC). This is why they can do more complex designs with less engineering labor. This is another reason why they get more done with less employees. That's a hallmark of a lean and mean company.
Yes it costs them some die area but, the gains in productivity, time to market and versatility more than makes up for it. The adding of third party modules for speciallized versions of the CPU can be easily accommodated as compared to Intel. Thus Torrenza becomes yet another advantage for AMD. If there is a niche market where they use a lot of cycles in some fixed algorithms, AMD can add a module that specifically implements that class of algorithms for a large performance enhancement. Most normal CPUs just would have this done with a software emulator but, those who would pay $$$ to speed this up greatly, would have it done with the coprocessor.
Most of the time what really happens is that a company that speciallizes in this area makes a external HTX connected coprocessor to do this. If there was a high demand like there was for GPUs, sound DSPs or the really well known one, FPUs, then the coprocessor module would be inserted onto the die and connected to the XBAR. Later if the demand was really high, it could be then moved into the CPU core with some additional performance gains like the FPU or MMX. So the HTX is used to have consumers vote on those coprocessors that make sense for them. Winners get onto the die. Big winners get integrated into the core. Thus the real world becomes the benchmark of what gets into the core, onto the die or merely externally connected.
So while Intel has to make the one die size fits all, AMD can make multiple die sizes that fit into what its customers demand. IE: "Do it our way!" vs "Have it your way!"
Pete
Dear Spaarky:
AMD DC Opterons were available before the launch. They were simply plug in and go faster. Last one for Intel? A long time ago. Before Opteron showed up.
Since on a typical AMD launch, parts would be selling already, the Woodcrest launch is going to be a paper one given that no parts have shown up. Mike's distributors are even saying they won't have parts until August at the earliest. That is just more proof that Woodcrest is being paper launched. We will see it what volumes they are shipped in when its available. Dell, the normal first to show, says they won't even have systems available until November. That real slow availability.
Pete
Dear Tenchu:
AMD is growing markets. What about 50x15? Coprocessors? 3rd party integration? Customer requested features?
Which markets has Intel grown? Big fat zero! They have either failed or are failing in these other markets. Perhaps its their "its my way or the highway" mentality. AMD's "lets go on together" and "I win and you win" mentalities. The former gets left behind and the later gets ahead.
Pete
Wbmw:
Considering that going from one SuperPI task, which runs completely in cache, to two reduces NGA's performance by 15%, I consider that as extreme proof that the shared cache is adding 15% to performance going from 2MB to 4MB. OTOH DC K8 going from one SuperPI task to two has little reduction in performance.
Another thing is that Intel doesn't want to release any DC scores which heavily hit the main memory. Perhaps the FSB can't supply both cores when the working set can't fit into the cache. Then the performance is mostly cache related and NGA becomes a dog when the going gets hard (large working sets and/or multiple running tasks). Also no 64 bit scores have been released. Could NGA be a dog in 64 bit too?
Like I have said before, these selective releases don't point to much wrt NGA's performance in the real world. You'll be disappointed when NGA hits the public meat grinder. But like always, you'll ignore all the tests that prove you wrong.
Pete
Wbmw:
Why do anything of the sort? You'll just ignore it and then claim something that the data just doesn't support thats favorable to NGA. Then after you are caught, you'll just make quips unrelated to the subject at hand.
If you really want to see differences between various AMD CPUs at varying cache amounts see this CPU comparison page from Tom's hardware:
http://www23.tomshardware.com/cpu.html?modelx=33&model1=245&chart=58&model2=240
Just look at the bars in red.
You can select which CPUs and what benchmark to compare with. Its so simple even someone as challanged as you can do it.
Pete
Wbmw:
You do not know what the percentages are for Conroe, Merom and Woodcrest as to how much can be attributed to cache. For K8 however, cache size benefits are well known as there are versions with cache totals of 256KB, 384KB, 640KB and 1,152KB for showing what the cache size benefits are for a given workload. There are also many tests that show what latency gets in benefits by using DDR with various memory timings. Furthermore tests are present show benefits of doubling bandwidth on K8.
Much has shown that the highest benefits on desktop applications are from latency reductions, followed by cache size and then bandwidth for single core desktop and mobile systems. On server type loads, latency reductions are still the highest benefits followed by bandwidth and then cache size.
By pulling out memory traffic from the HTT link in single socket systems, it makes comparisons with the FSB based Intel ones far more problematic. HTT only has to carry I/O traffic which doesn't load down the bandwidth all that much. FSB must carry both memory and I/O traffic and that causes a far greater increase in load. This affects latency due to FSB only going one way at a time and prior transactions must be complete before the next can start. HTT can read and write simultaneously. The latency is almost unaffected in the single socket scenario.
When multisockets are considered, NGA's FSB becomes a severe bottleneck and latency degredation accelerates. The tests of 75ns is where the bus is doing nothing but, memory accesses with no I/O from either core or memory accesses from the other core delaying the FSB from processing the request. This might push the latency much higher than what the synthetic tests indicate.
I find it telling that practically no tests has been publically released showing NGA's ability in heavy power user or server type workloads. Most of the multicore stuff seems to be single code multiple data where prefetching is simple and straight forward and where the two cores don't interfere too much with each other or one is idle. And 64 bit testing seems to be completely missing from the public eye.
All this basically states that NGA needs a much wider amount of public testing before it can be foreseen how well it performs vis a vis the well known K8. We will see how it does when put into the public meat grinder.
As to your dissing of the IMC and HTT versus the FSB seems to miss many advantages of that combination. Many of the advantages come from synergies of the combo of those two in concert with many others. You also seem to forget that FSB turn arounds flush the pipeline as the bus can't be turned around until all ongoing transactions are finished. This overhead seems to have slipped from your mind.
Pete
Chipguy:
Intel puts ten boatloads of stupid compiler tricks into theirs. They even bury all of the needed optimization switches directly into their compiler. I prefer the openess of a much wider applied compiler suite.
So a black kettle calling a copper pan, black, is just sour grapes.
Pete
Do you know how Intel does "Instant On"? They shut off the video output and mute the sound output. That's it. When turned "On", it merely turns on the video and unmutes the audio. It doesn't even slow down the CPU or puts it into some sleep state prior to turning it "Off".
Done that way, all modern computers can be set to do Intel's "Instant On". Set power management to turn the monitor off and sound off after x minutes. Wait the idle time indicated and Viola! The screen goes off and the sound mutes. Now move the mouse or type a shift key and WOW, the system "Instant Ons!"
Just more marketing spin instead of the real deal. Typical Intel garbage.
Pete
Dear Jhalada:
K8 has implemented micro-ops fusion since day one. Alan was talking about macro-ops fusion (which mostly shows up as x86 instruction fusion).
Pete
Wbmw:
You better take another look! The code sequencer is part of pick, not decode. It is however implemented as a 4 way massiveily parallel predecoder. What part of pre-decode don't you get? Thats the part where they find the beginnings of instructions and they can get 3 starting points for instructions every cycle, assuming that the whole of those three fit into a 16 byte fetch.
Here is the overall block diagram: http://www.chip-architect.com/news/Opteron_MPF_12.jpg
See the three way decoder. The stage you looked at was the pick stage which is shown as an all in one with pick. Even Tim Wilkins (AMD software optimization grooup) when writing the Opteron Optimization Guide stated that at least 2 vector path instructions can be decoded per cycle. And the order is fully independant. Complex instructions can be anywhere in any given 16 byte fetch. FSAVE is one such instruction.
As for Woodcrest samples, how do you know what clock speed they run at? As for 3+GHz K8s, Opteron x56s are being introduced later this month. Those are 3GHz K8 SCs FYI. Socket F Opterons in April. More 3GHz K8s. 2.8GHz Opteron socket F x90s mean that 3.2GHz SC Opteron socket Fs will be present for those applications needing massive bandwidth or memory sizes. Those will be given x58 models. They may not filter down to the desktop lines.
Keep dreaming Wbmw. Your BMW has turned into a go-kart. Its all you can afford.
Pete
And the so called tank will get blown up with my Antitank Marchitecture Destroyer. Its a DU round penatrating the turrent and then bouncing around inside shredding the occupants. Then you run screaming away claiming Intel is great before you are cut to ribbons. The truth hurts, doesn't it?
Pete
Wbmw:
Re: NGMA still has parts of P3 in it. The 411 decoder, now 4111 still can't decode more than one complex instruction per cycle. And if the next instruction is simple, can't even decode 1 complex instruction in a given cycle. K8 can do three in any given cycle. It makes it an all around performer on widely varying code.
Wrong again, Pete. AMD essentially has 3 simple decoders. Any "complex" instructions go through the vector path and the micro-code sequencer comes up with a uop equivalent, albeit at a cost to performance. You don't even know how this works, do you?
Wrong again Wbmw! Check out this image from Hans Devries: http://www.chip-architect.com/news/Opteron_1600x1200.jpg
And look over this: http://chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
THree decoders each of which can decode a vector path instruction. The trouble is that the next stage can handle only three uop pairs. So making some instructions from complex to double or direct increases the throughput. NGMA only allows 4 uops. K8 has 3 uop pairs, each with a ALU/FPU and AGU operation. And in the example I made, 1 simple instruction followed by a complex instruction, AMD can do it in one cycle, but NGMA takes two. Two complexes that make a ALU/AGU uop pair followed by a ALU/NOP uop and a NOP/AGU uop pair followed by a ALU/AGU uop pair is then fused into three ALU/AGU uop pairs. NGMA still takes two cycles to do that decode to K8's one. So all in all, NGMA's decoder can be slower in many cases, a little faster in a tiny fraction of cases and likely to be slower overall.
We still don't have a lot of details for NGMA that we do for K8. The block diagram may give us clues as to how they did it, but the devil is in the details. NGMA can do 4 wide decoding, but only if the first instruction is a complex or simple followed by three simple ones. K8 can decode three complex instructions, but only can send to scheduling three uop pairs per cycle. Of course we don't have details for K8L or even what tweaks has been added to K8F.
The proof is of course actual testing without the restrictions of NDAs and the amounts seen at retail. Having a few Conroes at 2.66GHz does nothing, if the bulk are at 2GHz. You can always cherry pick one (or a few) for benchmark fests. But if Intel can only make 100s at 2.66GHz and AMD can make hundreds of thousands of 3+GHz K8Fs, Intel will not have the performance crown no matter what paid reviewers say.
Pete
Wbmw:
Wow, for someone who prides themselves on intellectual honesty (the first words out of his mouth), this guy is anything but.
His entire response makes it seem like NGMA is a K8 clone, but without the integrated memory controller. He totally attributes energy efficiency as an AMD innovation (hello, TMTA?, not to mention Pentium M), and even goes as far as calling predictive branching, large buffers, and wide pipelining as K8 attributes that NGMA is copying (the point being that Banias launched with many of these things before K8 even hit the market).
You are being dishonest here. K7 had some of those things and came out before Banias and there was K6-3 which had an even better branch predictor than Banias, P4, Dothan and now NGMA.
The guy does not deny Intel's performance claims, but calls them pretentious; obviously, when you are AMD, you don't want to see your competitor wipe the floor with your offering, but the most that Henri Richard is willing to recognize is that Intel has caught up by copying AMD's design, copying their initiatives, but still lacks anything that can be called next generation.
Yeah, very generous (sour grapes) from AMD. I guess we'll see in a few months whether Intel's "K8 Clone / Quick Fix" product can outperform AMD's best response. If so, this guy is going to be eating a lot of crow.
Yet he has seen AMD's future products. You haven't. NGMA still has parts of P3 in it. The 411 decoder, now 4111 still can't decode more than one complex instruction per cycle. And if the next instruction is simple, can't even decode 1 complex instruction in a given cycle. K8 can do three in any given cycle. It makes it an all around performer on widely varying code.
And then AMD can do things not anticipated by you. In a few quarters, Conroe may come out to be a YAWN! Yet Another Wan Nabe like Prescott, Tejas, RDRAM and i820. Talk is cheap, its the meeting of the hype that is hard. Intel has failed to meet the hype so many times in the past. And even when the design comes close to meeting the hype, they can't produce it in enough volume at the hyped speeds. Look at Core Duo aka Yonah. According to Rahul, its "impossible to get". This from a supposed well oiled fabbing giant. Who is to say that they may have production problems with Conroe too. You not only need to have a good design, you need to make large quantities of them. Failure to do either leads to big problems.
AMD has proven that they can make large quantities of top performance product and they continue to do so. All I'm telling Intel is to show me. They haven't yet. You OTOH continue to be wrong!
Pete
Wbmw:
Intel must be worried. They didn't show how much Conroe really drew in power. Could Conroe do a Prescott because it draws too much? Saying a CPU uses xxW TDP is nothing against showing what it uses. We know how much a FX60 uses from many third party sites. Of course they have 6 months to reduce the power to fit within the bin, but as has happened before, they could miss and have to slip the launch.
Pete
Wbmw:
Where was Intel showing of processor power consumption on those PR fluff so called review pieces? If would have shown Conroe in a better light, they would have had them. But since they didn't, perhaps Conroe was drawing more power than the OC FX-60. And the perf per watt benchmarks would have shown FX-60 in a better light? And even ahead of Conroe even with all of the disadvantages? Not having them for Conroe makes them irrelevant for comparisons. We know that FX-60 does well below the published TDPmax for it from Xbit and other sites.
Any truly independent third party would have at least attempted getting power draw readings. You can tell where Conroe would be shown in a bad light by what Intel didn't say, do, show or allow anyone else to do. They didn't use a standard BIOS. They didn't use an off the shelf video driver. They did not publish the setup, hardware used, device settings, etc. They didn't use standard off the shelf software or demo scripts. So much was non standard that any benchmarks would have to have lots of notes detailing differences with the norms.
If the results can't be reproduced by independent third parties, the claims must be thrown out and become irrelevant. Due diligence isn't an option, but a requirement with hardware reviews. Too many review sites have lost sight of that.
Pete
BTW, I called that the Hexus.net review's PCMark 2005 Memory scores were low for the FX60 OC given the published settings and subsequently was proven correct. That is not "missing" from here. Likely just your typical reading comprehension problems.
Dear SmallPops:
From my post on SI:
As far as Hexus.net, the PC Mark Memory score is telling. The FX-60@2.8 got about 4500 which is about the same score (4445) as a FX-60@2.6 got with 2-3-2-6/2T on a Asus A8N32-SLI MB. The FX-57 running at 2.8GHz got 4745. Thus the memory on the RD480 was not actually running at 2-2-2-5/1T, but more like 3-3-3-10/2T probably due to the bad BIOS version. The CPU score looks to be ok (~5600) for a 2.8GHz FX-60 given that it runs completely in a tiny L1 and that a 2.6GHz FX-60 got 5218 on that same A8N32-SLI. So even Hexus.net didn't catch on that the memory wasn't running at the settings claimed.
The DFI RD480 BIOS trouble is real and it affects memory performance which is typically the same (within 1%) across all quality K8 MBs with the same CPU. Given that it affects memory scores by about 10% or so, the other bugs fixed in the list can affect scores even more. Thus it invalidates most of the other tests.
Pete
PS Petz on SI found the following:
OK, you want proof?
Hexus.Net 1024x768 UT2004 BotMatch "medium settings" FX60+ 160 fps
http://www.hexus.net/content/item.php?item=4843&page=3
Dual Video Cards: ATI 1900XT Crossfire
Memory 2x
Motherboard: DFI RD480 mainboard
CPU: FX-60, overclocked
Sharky Extreme 1024x758 UT2004 BotMatch "maximum detail graphics" FX60 214.4 fps
http://www.sharkyextreme.com/hardware/cpu/article.php/3261_3576616__7
Single Video Card: eVGA GeForce 7800 GTX KO 256MB PCIe
Memory: 2x512M
Motherboard: DFI NF4 Ultra-D
CPU: FX-60
WHOOPS, Intel!
Wbmw:
On Unreal 2004, at Tom's Hardware, a FX57 running on a Asus A8NSLI MB with a single 6800GT at standard clock and with standard drivers, did 189.5FPS. Contrast that to a Twin X1900XTX on a 975X Crossfire MB and a 2.66GHz Conroe did 191FPS with specially coded drivers. Given that the FX57 used a single previous generation nVidia GPU, it nearly matched that Conroe Dual top end ATI flagship GPUs on a MB at least twice as expensive, the other benchmarks are equally suspect. A standard A64 X2 4800+ did 164.6FPS on the same 6800GT, 4FPS faster than that twin X1900XTX with a CPU clocked 400MHz higher. I'm sure that a twin GeForce 7800GTX/512 running on an optimized 32SLI MB using a FX60@2.8GHz would do much more than 191FPS. And that's before Nforce5, AM2, dual DDR2-800 and Rev F.
Pete
Wbmw:
You use SPEC2Kbase when everyone knows it auto figures the optimal flags for SPEC subtests. So for ICC on Intel CPUs base=peak. No other compiler had this tendency including those from IBM, Sun, DEC, HP, Cray, MIPS, GNU, Pathscale, PGI and MS. That is why spec.org means peak when they say score. Intel and You are guilty of using special cases, rigged benchmarks and special circumstances. K8 doesn't need that. It uses Pathscale, GNU, PGI or Studio and the scores between its CPUs are relatively the same. And scores with things like SPEC translate well into real world uses. Intel CPUs do not translate well into real world uses based on benchmark scores.
I find it odd that SPECint2Krate and SPECfp2Krate were not cited wrt NGA. They are the multicore flavors of the SPEC CPU suite. Given Intel Marketing's past uses of benchmarks, I find this telling that NGA will not do very well in normal benchmarks especially those non rigged real world ones. Games are typical ones especially simulator types for desktops. The good ones really work over a CPU.
The only ones doing the dancing and wriggling around is Intel and others like you.
Pete
Wbmw:
Wrong again as usual. Best windows Opteron score is from http://www.spec.org/osg/cpu2000/results/res2005q3/cpu2000-20050819-04517.html
1956 using Intel C++ 8.0 build 20040415Z for IA32 on Windows XP Pro SP2.
You also must use SPEC scores, not base as peak is implied by spec.org.
In SPECfp2000, Opteron gets 2344, not 2212: http://www.spec.org/osg/cpu2000/results/res2005q4/cpu2000-20050906-04675.html
And Woodcrest scores may also auto paralize both SPECint2K and SPECfp2K which makes the 2518 SPECfp2K score to be compared against.
Rev F is to have at least 10% improvement in SPECint2K and SPECfp2K clock for clock with Rev E. The Opteron in the SPECint2K is a Rev D part so its likely to be even higher given ICC9.1 (patched).
Besides given that at some point this year SPECint2006 will come out having much larger datasets, memory footprints and will no longer fit into the large caches present now. Almost all of SPECint2K fits into the L3 of Power 5+ (128MB L3). Given that SPEC2K was to fit into 256MB memory machines of the day and that current WS/Server users routinely go beyond 4GB, SPEC2K has become a "toy" benchmark (spec.org remarks on memory usage of SPEC2K). 2GB memory footprint is the probable new target for SPECint2006 and SPECfp2006 with 10 times the cache needs. Then adding cache won't be a big help.
Pete
Dear Keith:
Also the retail box version of Opteron 285 and both versions of Opteron 885: http://castle.pricewatch.com/s/search.asp?s=opteron+885&srt=t&his=0&paging=1&i=3&...
I guess Sun will soon push Opteron x90SEs (2.8GHz DC).
Soon we should see Opteron x56s (3GHz SC).
Pete
Smooth2o:
To even get close to lean and mean, Intel would have to cut $3 billion from expenses each quarter and yet mantain output. At lower output, they would need to cut even more. Do you see the magnitude of the problem? It took a long time for that overhead to build up and its likely to take a long time to come down. If they do it too fast, they will lose the core of their R&D, engineering and process talent. Their first move should be to slice and dice upper management followed by non core activities. You know those that go into that huge money drain labeled "Other". And of course, stop those pesky stock repurchases.
That's what needs to happen because you neglected your cash cow and now its sick and tired. The better policy is to take the short term pain and let it get better, instead of trying to continue output which lowers the quality and thus, the price. Soon it will die and leave you with a lot of bills. Starting a price war is stupid when your competitor is capacity limited with superior products.
Pete
Smooth2o:
Intel has not been lean and mean for a long time. They have lots of overhead. They will lose money in droves at $100 ASPs. AMD is making money with ASPs lower than $100.
Pete
Dear Rlweitz:
The sweet spot is more like 2.2-2.4GHz as many of the lower clocked CPUs can run at stock voltage and stock cooling at 2.6-2.8GHz. Then you back down the 10% typical guardband and you get 2.3-2.5GHz which brings you back to the 2.2 and 2.4GHz bins. Semprons use a different die with 256KB L2 as well (Palermo). Athlon 64s use either 512KB (Venus) or a 1MB die (San Diego). The A64 FX use San Diegos or possibly a Toledo with one core deactivated. X2s use either Toledo (2x1MB L2) or Manchester (2x512KB L2). Rev F will use new names.
Another showing is that lower clocked A64s are in short supply since they can sell most of what they can make into higher bins. Only recently has lower clocked A64s begun to reappear. But thats likely to be Fab 36 output kicking in making enough to have high clockable CPUs downbin into the lower bins. In a few months, Chartered will take away the fabbing of Semprons and Geodes leaving Fab 30 to concentrate on A64s and SC Opterons. Fab 36 will likely concentrate on DC Opterons, X2s, FXs and any QCs that are needed.
Pete
Dear Chipdesigner:
Intel claims that 3GHz Woodcrest will need 80W TDPtyp. Changing this to AMD's TDPmax given previous ratings on current Intel CPUs is about 110W. 2 of these used for Cloverton would require 220W for 3GHz. I don't think Intel even wants to go there. Intel's 65W TDPtyp is equal to at least AMD TDPmax of 87W and that is supposed to top out at 2.67GHz. That still means 2.67GHz will be at least 175W TDPmax. Merom at 2.33GHz is 45W TDPtyp or 60W TDPmax, so 2.33GHz Cloverton would use at least 120W TDPmax.
I did some figuring at Ace's showing that Taylor at 2.8GHz would use 27.65W going from the 1.8GHz DC Turion 64 at 1.075V (0.9V at 0.8GHz): http://www.aceshardware.com/forums/read_post.jsp?id=115154660&forumid=1
Extending that for QC Opteron, I get:
Dynamic power = 4*1W*(2.8/0.8)*(1.25V/0.9V)^2 = 27.00W
Static power = 2.5*3W*e^(0.35V/0.3V) = 24.08W
Total power - 24.08W + 27.00W = 51.08W
Granted this may be from a cherry picked part and at the edge of working, but by adding a little more voltage for stability to say 1.35V, I get:
Dynamic power = 4*1W*(2.8/0.8)*(1.35V/0.9V)^2 = 31.50W
Static power = 2.5*3W*e^(0.45V/0.3V) = 33.61W
Total power - 33.61W + 31.50W = 65.11W
Adding in 10% guard bands for voltage and frequency, I get:
Dynamic power = 4*1W*(2.8*1.1/0.8)*(1.35V*1.1/0.9V)^2 = 41.93W
Static power = 2.5*3W*e^((1.35V*1.1-0.9V)/0.3V) = 52.72W
Total power - 52.72W + 41.93W = 94.65W
So darn close to the 95W TDPmax for SC Opterons. Allowing for higher members of the QC Opteron family, 140W TDPmax sounds about right (allows leakage to double).
2.8GHz QC Opteron versus 2.33GHz Cloverton would be a slaughter of Intel and about the current performance gap.
Like I said, Intel's getting Excedrine shipped in by the crate.
Pete