Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Wbmw:
You are the one who posting BS. Intel has missed schedules plenty of times, yet you think they make schedules and keep them. You are the one making errorneous statements. They have changed schedules a lot. And you think they make a schedule and keep to it. That yet is another set of errors. You think that they deliver, yet there have been plenty of times where a scheduled thing is flat missed, canceled, or conveniently forgotten. And then there are the times when what was delivered was far less than promised. So there are more errors on your part.
Frankly, with so many errors, your post turns out to be just plain BS based on nothing more than hope and faith. Yet when called on it, you devolve to name calling.
And you claim that AMD can't meet schedules when they have the exact same problem, far more demand for Barcelonas than they thought they would have at this point. They planned for 10's of K, but demand is 100's of K. You claim that it is a good thing for Intel, but not for AMD. That is being a hypocrite.
You set the definition and when your side has the same thing happen, you back up and claim that the situation is different. Well it isn't. The plain fact is that it takes 3 months for a demand upgrade to be satisfied with new supply for either company. Now that Intel can't dictate the amount companies have to buy, they will have more of this happen over time. 18 months ago, Intel had too much supply chasing too few buyers and inventory skyrocketed. And both companies worry that the demand is soft (double ordering or worse) so that they may just make enough for orders in hand. If demand turns out to be hard, then the shortages would continue.
Pete
Wouter:
That so called Intel advantage is done by having to change the schedule many times and their boosters let them get away with it. Didn't Intel just warn that they can't deliver quad Core 2 CPUs in a timely manner? That they can't meet their commitments? With all their supposed capacity?
Wow, there must be two Intels to have such a disparate images. One must be the real one that has problems and warned of shortages and one fantasy one that never screws up. The latter one that never slipped Itanium launches, got stuck with RAMBUS and didn't have i820 problems, Prescott slips, P3-1.13s or FDIV happen.
So since Intel changes schedules so much, they really don't "set a schedule" and keep it. And they fail to deliver (Tejas, Foxton, etc.) and not on time (Prescott, Itanium, etc.). So that supposed "advantage" is just another myth.
Pete
Then how about the IBM submission that also did a single dual socket Clovertown box with an even lower score? Are you suggesting that IBM was "cheating"?
I think taking 8 dual socket Woodcrest boxes and using only one core in each socket is cheating of an even worse kind. If they were forced to use all cores by forcing them to put in the submission total core count (including the supposed inactive ones) since the "inactive" ones are still doing those OS and overhead tasks like communicating between the boxes. They would have to state that the 8 dual socket Woodcrest boxes had 32 cores and not 16, the positioning changes mightly.
Wbmw:
Don't you know how to read these configurations?
Nodes are boxes in the cluster. Processors per node is supposed to be the number of sockets and cores per processor is just what it says it is.
So for there to be 32 cores in the Intel cluster tested, there must be 8 boxes, 2 sockets in each and two active cores per socket. Clovertown though has 4 cores. So either the cores and sockets are given by the CPU with software which sees a single socket Clovertown as two dies with two cores each or truly sees a two sockets with only two cores active in each of two Clovertowns. I suspect it is the former interpretation, 8 single socket Clovertowns see by the software as 8 nodes of two dual core dies. Thus the real competition is 4 nodes of 2 sockets each of Clovertowns, but not a single Clovertown submission has either 4 x 4 x 2 or 4 x 2 x 4.
There is one 3 vehicle collision using 2S 2.67GHz Xeon 5355 from IBM taking even longer, 34010 than a 2S 1.9GHz Barcelona which takes only 27020 also from IBM, both show 1 x 2 x 4 which means one box, 2 sockets and 4 cores:
http://www.topcrunch.org/benchmark_details.sfe?query=2&id=543
http://www.topcrunch.org/benchmark_details.sfe?query=2&id=750
Now you don't have a leg to stand on, unless you want to accuse IBM of cheating.
Pete
To quote you, Paul, "No one gives a rat's fundament what *you* will or won't call anything. The rest of us live in the real world, not" Demon Park.
The fact is those who get the data don't want to be snowed by overlapping definitions. The numbers don't add up then. By your lousy definition every PC is a server including laptops.
Pete
First, I never call single socket anything to be a server. The question being is what Mercury for your numbers or Gartner for mine calls a server as opposed to a workstation or PC. And do they call a 64 node cluster as one server or 64? Also you don't take into account any upgrades where the CPU is sold to add into an existing box. Or like Dan where he swapped newer Barcelonas for DC Opterons. ORNL swapped out all single core Opterons for dual core ones in their supercomputers including Red Storm.
1.8 million servers for $6.9 billion is $3,833 each server. That is more than the average 2S server goes for ($2K), but less than a 4S server ($6K). Thus 2.5 is not unreasonable given the discounts one gets when buying a bunch.
Second, I used Q2 simply as a reference. With sales expanding every quarter, Q2 is higher than last year's Q4. For desktop PC shipments where the growth is less, seasonality has more influence on sales.
Third, "all time" is wrong. AMD had less than 10% server share pre Opteron (2003). And during the Q2 final update, the numbers used were revised upwards (iSuppli) by 1%. AMD's server share likely grew in Q3, 2007. We will see in about a month when Q3 reports come out. The earnings reports in two weeks will give us a good clue though.
Its your interpretations that have been shown to be quite false.
Pete
Your numbers are off as server units means servers, not CPUs.
For Q2, 2007, the last quarter for which the information is in, the server market had sales of 2.06 million servers. 1.8 million of those are X86 and AMD64 servers. If the average server is 2.5 CPUs, thats 4.5 million x86 and AMD64 CPUs (heavily weighted to the latter). That is an annual rate of 18 million AMD64 CPUs. The Q2 server market revenue for AMD64 and x86 servers was $6.9 billion at a 15.5% growth rate. Thats an annual rate of $27.6 billion. That compares to the total market of $13.2 billion at 6.3% and $52.8 billion annual rate. The x86 and AMD64 servers had more incremental revenue than the total server market had ($1.07B vs $0.83B).
http://www.itjungle.com/bns/bns082307-story01.html
Dear Wouter Tinus:
That is a preview on mockup old Broadcom server board. You do realize that the MB has to be designed for DDR2-800. So does the BIOS. It is also not designed for split power planes either unlike some nVidia 3600 Professional MBs. Beside they were trying to mock up a Phenom MB for some preliminary tests.
Phenom will be using unregistered memory, up to DDR2-1066. BTW, many A64s use DDR2-1200 without trouble.
Pete
Dear Tecate:
Again you are wrong since you can get quad core Opterons at newegg, but neither Penryn nor Harpertown are available.
So its a real launch against vaporware.
http://www.newegg.com/Product/Product.aspx?Item=N82E16819105165
And yes, they are in stock.
Pete
Dear Wouter Tinus:
DDR3 support was not listed in table 1 on page 20 (of 365) for rev B. Could you point to where it is stated that Rev B has DDR3 support? It is likely to be in a later rev though.
Just as HT3 DC link speeds are supported in Rev B, but not AC links or unganging features.
I just checked the two articles with Barcelona on Anandtech, there is no mention of them trying faster registered DDR2 in the Barcelona systems in either article, the one on Barcelona launch on September 10 or the one on the 18th against Harpertown.
Pete
Dear J3Pflynn:
Good for you. I don't resort to it either.
Pete
Dear Ephud:
Newegg sells Barcelona, Opteron 2347.
http://www.newegg.com/Product/Product.aspx?Item=N82E16819105165
So where is Penryn? Not available!
Where is Nehalem? Not available!
Just goes to show that people that resort to name calling are immature, wrong and as such, losers!
Pete
Dear Wouter Tinus:
Registered DDR2-800 is supported in all B revisions. See Table 1 on page 20 of the Bios guide:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116.pdf
FYI the current Barcelonas are revision BA. The next stepping is B2. Also note that HT3 is supported as well. No current server though has HT3 support (that might change in near future) although between CPUs still might use cHT3 since those are P2P links.
Pete
Dear Ephud:
Barcelona did not get its clock cleaned. It won many benchmarks that are relevant. To a pair of unavailable cherry picked CPUs on a unavailable cherry picked platform and with unavailable cherry picked memory. Barcelona can use registered ECC DDR2-1066 memory, why wasn't it used? Its available unlike DDR2-800MHz FBDIMMs. As Dan says, he maxes out the memory, yet the reviewers didn't do that to the servers. That is typical for servers. Yet they tested for performance and performance per watt.
Pete
Dear Tecate:
I guess being called on your foolishness puts you into a tiff!
Pete
Dear Tecate:
Should we all call Intel quads as duck taped bicycles? As a bicycle with training wheels? As dual do overs? Thought so. Then use the proper names.
Name calling is the last bastion of losers.
Pete
Dear Thread:
Registered ECC DDR2, PC2-6400 has shown up. So Opteron socket F can now get 12.8GB/s of BW per socket. Look for Intel servers to fall further behind.
http://www.memory.com/DDR2800240R.asp
Also I saw Grindhouse today and the Death Proof credits have thanks going to AMD and Hector Ruiz right on the first line of the thanks section. It was partly filmed in Austin, Texas.
Pete
Dear Dave:
Sorry had to restart opera. Fixed.
Pete
Dear Dave:
I haven't been able to get into SI since 11:45PM Saturday.
I attempted to get www.siliconinvestor.com from the IHUB home page even and still no go. It waits forever for a connection. Of course I haven't had any such problems with IHub.
Pete
PS I am comming from sbcglobal.net.
Wbmw:
Well AMD has lower power chips too. Even ones that run under 1W. First there is Geodes which are SOCs that run 333MHz and 433MHz at 0.9W. So can the 1.1W Geode 533MHz. Then you got the rebadged Athlon XPs, the 1.5W Geode NX 667MHz and 1GHz followed by the 6W 1.4GHz NX 1750. They all can run x86 software which is all Core duo can run anyway.
Second Turions already have a 25W MT line in addition to their 35W ML and 37W MK lines. Where have you been? And Turion X2s at lower P-states (lower voltage and frequency) get well under 17W. Heck at 1GHz, they are under 9W. And all of these do AMD64 which Coro Duo can't. And these figures include the NB and the memory controllers not found on any Intel CPU. Add those into the Intel CPUs (the FSB and MCs from the NBs), and you get quite a bit more than 9, 15 and 17W. More like 14, 19 and 21W. Then you should also take into account the different measurement standards and I wouldn't be surprised that the L7400 Core 2 Duo is over 25W TDPmax.
Most who look at such things notice that Core 2 Duos just don't have a low idle power like Turion X2s have and that compares a 65nm CPUs to 90nm ones. Look for the 65nm SOI Turions and X2s to do quite a bit better at lowering power than Core 2 Duo. Core Duo's demand is going to shrink fast due to their inability to run any AMD64 software. And they will have to shrink power demands quite a bit before joining Geodes in the embedded markets.
Fourth, you'd have to assume that Intel is sitting still to believe that Turion will surpass them. On the contrary, Intel already has another bin or two planned this year, and in addition they will have 45nm parts by this time next year.
Why? You always assumed that AMD was standing still and they haven't been. Now with Intel not being able to raise clock speeds over the last few years, you assume they can magically do so.
Re: Barcelona derived mobiles will do even better when each core can have its own voltage.
And this is just speculation, since nothing is known at this point about Barcelon's power profile at mobile wattage levels. Your assumptions are obviously wishful thinking.
That goes against what AMD has stated. Each core has its own multiplier and thus frequency. And Barcelona derived parts do not have to include the second FPU set for mobile purposes although they will likely retain the 128 bit SSE load store units. Especially since integrated R580 and 600 class ATI GPUs can do quite a bit of FPU work. Something Intel isn't even rumored to have until 2012 at the earliest. They could use nVidia IGPUs but, they don't seem to be inclined to allow it.
Thus your assumptions are wishful thinking. More like whistling while walking into a large mine field. "Click!"
Pete
Dear J3pflynn:
Going from 65W to 45W on the desktop means a reduction of 30% in power (65W * 70% = 45.5W). So SC Turions will go down to 25W (35W * 70% = 24.5W) and 17W (25W * 70% = 17.5W) which are below Core Solo. In addition, Turion X2 will go to 25W (35W * 70% = 24.5W). This is great for laptops since we all know that power reduction is more prized than performance improvement. In addition static leakage probably went down as well so K8G will be even better at idle. Hours on the same size battery is likely to go up 20-30% (other components won't go down that fast). Lpatops will go from 3-5hrs run time to 4-6.5hrs run time with the same performance.
Barcelona derived mobiles will do even better when each core can have its own voltage. Look for the power reductions to reach 40-50% at same performance. That means that SC Turions will go from 35W and 25W to 17 to 21W and 13 to 15W respectively. DC Turions will drop to 17 to 21W with laptop run times to increase to 4.5-7.5hrs. Add high performance ATI chipsets and integrated GPUs and Intel will be in a world of hurt.
And that is before AMD's increase in capacity is figured in.
Pete
Dear Kpf:
I have gotten Linux to install on A64 X2 laptops. My brother has one running on a Turion X2 TL-52 laptop. I have also installed it on a couple of HP A64 17" laptops (I forget the exact model number but it comes with a keypad desired by the accountants who use them). Gentoo AMD64 versions are on all of them. My Dad likes Slackware and has 11.2 installed on his. He has a little trouble with speakup and the KB. One of his Linux system administration students needs it, because he is blind.
Pete
Dear Tench:
I use Linux most of the time. Disk is cached in unallocated memory. Much of the time, disk is not accessed, but things are read straight from the caches and written there. Slowly over time the dirty cache pages are written back to disk which usually is within 30 seconds of operation. You can at any time, sync the disks by flushing the dirty pages back.
With memory sizes averaging above 1GB, the amount cached is about equal to a CD. It is interesting that many 1GB flash drives write nearly instantly, but take a minute or two before you are allowed to remove them from a port (umount in Linux parlance).
Another method to speed up disk accesses is to mirror your data drives. Since either have the same data, you can read from either one. So two 7200 rpm drives appear to be one 14.4K drive. And since Linux does do software RAID 0, 1, 5 and any combination of them, you can have the benefits, even if you don't have a RAID controller. And it doesn't need the drives to be identical either in size, type or speed.
One trend is also helping, increasing disk block sizes. It reduces the overhead of disk cacheing and the file systems used. 512 bytes was a lot when you had a few dozen KB of memory. Its tiny with GBs of memory. 4KB is the new size and 2MB isn't that far away (one more order (10x) of disk interface speeds). Those coincidently match AMD64 page sizes.
As for networks, the same is true for larger packet sizes. Most routers and switches are packet number bottlenecked and not bandwidth bottlenecked. A 10Gb link can transmit over 23 million 53B ATM packets or 800K 1.5KB TCP/IP packets (the maximum size allowed by the standard). Most software based routers including those in your PC only handle about 500K packets per second.
Web based serving is limited by the slowest link between you and the web server. And congestion can make a fast link be slow. Most of the time the bottleneck is the net and not either you or the web server. Multimedia serving however, the limit is usually the "last mile" to the home. I have 6Mb/768Kb ADSL currently. hat works out to about 660KB/s down and 85KB/s up. Getting FTP downloads from other Linux boxes, I get close to the maximum 600+KB/s. Video comes at about 50 to 200KB/s depending on type. Web surfing peaks at about 100KB/s, but averages below 10KB/s.
I could live with a 1.5Mb/256Kb ADSL line or cable and do most of what I want. If the FCC and Congress were more diligent, most of us should be at 10BT speeds (10Mb/10Mb). Fiber to the home gets you into the 1Gb/s and up (beyond 1Tb/s). Currently the infrastructure could handle 10BT speeds for everyone. The sustained throughputs would not be much more than 40Mb/s down and 500Kb/s up even with fiber to the home. You don't look at more than two HDTV channels on average and send a 1hr home video more than once a day. Phone conversations, on line game playing and web serving don't take much compared to those.
Pete
Tench:
There are applications that frequently and/or randomly access data in higher levels of register cache memory disk heirarchy. If they don't do it beyond L1, they are light weight tasks. If they don't do it beyond total cache, they are medium weight tasks. If they do it to memory, they are heavy weight tasks and if to disk, they become extreme weight tasks.
Just because 90% of the accesses are satisfied by the L1 cache, doesn't mean that a task isn't heavy weight. With a MCW, if the task accesses a set of 256 random locations in 64KB of memory covered by a few L2 sets, the chances are low that any particular access is already in the cache causing most of the accesses to refill a cache line with DRAM. If those are 5% of all accesses, the average access to memory will be high say 50 cycles and little processing will be done as the average 100 instructions takes 3.5 cycles instead of less than 1. This would be classed as a heavy weight task even though 64KB should fit in the L2 cache.
Granted that is an extreme example, but you would be surprised how easy it is to make programs like the above. Just a tiny amount of care makes cache unfriendly programs into medium weight cache friendly programs. With some harder work, you can make most of them into light weight programs which run substantially faster than medium weight ones.
You can see why most encoding programs fit mostly in L1 as the memory needs beyond 32KB of data are quite predictable. The program will know which block of memory will be needed in the near future and start the process to write the previously finished block back to memory and load the new one into L2. By the time the current block is finished, the next block is already in cache. So even though L2 and memory are used, their use is quite predictable. The same is true for the disk being used. It is mostly looked at as frames or groups of frames rather than blocks. So by the above definitions, encoding is a lightweight task.
Most games are medium weight tasks. Their code extends over much more than 32KB of instructions and their data extends over much more than a few MB. But most of that is accessed periodically over time. Access to a few hundred KB is needed at any given time. They are usually designed for the smallest cache CPUs in common use. That is currently about 256KB. Typical access patterns are lists and tables. These are predictable to some extent, but not as much as encoding. Thus these are medium weight tasks. More than one of these tasks per core, casues them to look like a heavy weight task as they do tend to stomp on each other. Mostly because the access patterns tend to hit the same cache sets. More cache ways help keep them from doing that.
Servers have many of these medium weight tasks often times more than a few per core and its good that their access patterns are so varied as more of them can be carried without too much stomping on each other. But much stomping on each other still does occur wrt cache. Thus the server approximates what a few heavy weight tasks per core would do. Missing L1 caches would still have a great impact on performance as they are still a 4-10 times booster. But total cache size would not have as much. These loads like large amounts of low latency memory capable of many simultaneous reads and writes.
Pete
Dear Alan81:
If Sun did so well on Art (+203%), then people should complain on how Intel did so well on Sixtrack (+156%) and Mesa (+137%). Sun did better with Swim (+259%) than Art. Sun's Opteron 2218 wins in 10 benchmarks to Dell's Cloverton x5355 wins in 4 benchmarks. Dell always could use Sun's Solaris and Studio for their submissions instead of MSFT x64 XP and Intel's icc and ifc.
Intel uses tweaks to the way it handles flags for SPEC. Thats why base = peak compared to the action of any other compilers. It qualifies as doing an iron cross and trying to cross your eyes.
Quit whining about what Sun did. AMD has operated under a compiler disadvantage almost all its life. Now that the shoe is on the other foot, all this bellyaching comes to the fore. Face it, Opteron 2218 beat Cloverton on a lower clock (2.6 vs 2.67) and half the cores on 10 of the 14 SPECfp_rate2000 benchmarks.
Pete
Tenchu:
Weight in this discussion was about the amount of cache required to hold the working set. Light benchmarks have cache working sets that fit into L1. Medimum banchmarks have cache working sets that only fit into L2 or L3. Heavy benchmarks have cache working sets that don't fit into L1, L2 & L3. They run mostly out of main memory.
DivX encoding is a light weight benchmark. It has a small cache working set. It fits into L1 on most modern CPUs. Its easy to see as a larger L2 cache doesn't affect performance much. It doesn't mean that it won't eat a lot of cycles doing its work.
Another lightweight benchmark is doing MD5 sums for every file in your system. It also takes a long time, but that is because it takes a long time to read all the files. Diito for a backup program that does software compression to reduce the amount of media consumed.
DivX encoding is also done by hardware. If you do a lot of it, it is more cost effective to get a DivX encoding accelerator. It is built specifically for that that and other such tasks. It doesn't use much power and completes the task without much CPU intervention.
You follow Intel's lead and restrict the scope to lightweight and medium weight tasks to show how good their CPU is. They don't want to show heavy weight tasks where their CPU does poorly (who could blame them?). You twist away from any benchmark that shows otherwise. If you can't twist away, you pooh pooh it saying its not relevant.
Very few benchmarks show little change when you shut off L1, L2 & L3, if present. I don't know of many featherweight tasks that would fit that bill. And I don't diss benchmarks as long as they are used responsibly. Using DivX encoding to state that a system is better than another over all work loads isn't a responsible use.
Pete
Pete
Wbmw:
TPC-C is a disk I/O bottlenecked benchmark. Memory just assists in lowering apparent disk latency. To get higher scores, TPC-C requires increasing the size of the data set. This leads to high anounts of disk relative to processing power. Normal database configurations have 10% of the data set for memory. 300-400 73GB disks would need about 3TB of memory. No benchmarked system at the 200-400K score has that much memory. For systems with 64-128GB of memory would be used for 500GB-1TB data set.
Just having terabytes of disk doesn't mean anything about the size of the working set. A simple backup of that data set doesn't take much of a working set. It just copying from one place to another. It just takes a lot of time.
So get a clue.
SAP's benchmarks their SD (Sales and Distribution) application for the SAP scores. This is more heavy duty than TPC-C above. The problem is that costs aren't disclosed to allow for apples to apples comparisons.
As to my knowledge of servers covers large VAR applications like Warehousing, ERP, Construction and Telcom. These applications are compsed of 100's of programs and have million lines of code and up. In normal use, thousands of processes are active at any given time with hundreds of users.
The cache working sets were quite large over 10-100MB on most of these. Increasing cache size doesn't help performance much. Adding memory did. Memory working sets were in the 8-80GB range in a lot of these. Get below that working set size and the systems slowed to a crawl. Faster, not larger disk also helped performance once memory was sufficient to contain the maximum working set.
These are the type workloads that customers test a new server. They typically hit the test system are harder than their heaviest days and saw what the response time was. If it wasn't up to what they already have, the server was rejected. They don't accept what TPC-C or SAP says about a server. Only their testing mattered.
If you think otherwise, leave server benchmarking to the those who develope, configure and build such systems.
Pete
Elmer Phud:
On large workloads, a 2 socket Opteron 2218 beat a 2 socket Cloverton X5355 by 12.5% (SPECfp_rate2000). Why pay upwards of $2600 for Cloverton X5355s when less than $1400 for Opteron 2218s do better?
Pete
Elmer Phud:
Base scores are all that are submitted by Fujitsu.
Fujitsu submits peak scores for its Opteron servers.
http://www.spec.org/cpu2000/results/res2006q1/cpu2000-20051223-05324.html
Perhaps base matches peak only for Intel compiled code on Intel CPUs. Not true for any compiler targetting Opteron. So not using peak scores is very relevant.
You just don't like to use them because it makes Intel look worse.
Pete
Wbmw:
I showed that even against itself, one die versus two dies per socket, SPECint_rate2000 has a working set that fits in cache. SPECfp_rate2000 doesn't. And when it doesn't, performance suffers greatly.
TPC-C is not a large working set benchmark. It is mostly a memory and disk size benchmark. SAP does have a larger working set, but I don't know it that well. And it isn't being "killed" in SAP. Tulsa requires 16MB of cache to get ahead using IBM's better chip set. The best Opteron score I found was this one by Fujitsu (8 way, 217330, http://www.sap.com/solutions/benchmark/pdf/cert4006.pdf ). The best non IBM Xeon score (so that the Hurricane chipset is not used) is this one by HP (4 way, 213000, http://www.sap.com/solutions/benchmark/pdf/cert6006.pdf ). The Opteron system uses 2.4GHz 880s 2x1MB L2 and the Tulsa system uses 3.3GHz 7140Ns 2x1MB L2 & 16MB L3.
The big problem with these is that prices are not included so price performance comparisons can't be done. A Opteron 880 goes for $1099 while a Xeon 7140N goes for $2252. The MBs will likely be similar because the 4 socket has to include a Xeon MP chipset while the 8 socket Opteron 940 MB has been around a while (I think Fujitsu uses the Iwill 8S MB).
As for working sets on some benchmarks you can see then here: http://archvlsi.ics.forth.gr/html_papers/TR192/node2.html
Be careful, I use working set here to denote cache usage, not memory usage.
Here is an OLTP look at cache usage:
http://www.cs.wisc.edu/~mscalar/papers/2006/isca2006-coop-caching.pdf
It even compares all independent caches, SRQ like cooperative caches and shared caches like C2D.
Pete
Elmer Phud:
Continuing to use base scores are a bigger lie. I showed that where woring sets increase, quad core does worse even against the dual cores using the exact same core. I also showed that SPECint_rate uses too small a working set to be relevant for server workloads. SPECfp_rate2000 has a larger working set and was shown to use the FSB a lot more. Where are the quad core SPEC CPU 2006 scores?
Pete
Dear Mas:
Using the larger working sets of SPECfp_rate2000, 2 Opteron 2218s beat 2 Cloverton X5355s by 12.5% (117 vs 104). 2 1.86GHz Woodcrest 5120s match a 2.67GHz Kentsfield QX6700 (65.0 to 65.0). That means a $1122 Kentsfield QX6700 is matched by 2 $210 Woodcrest 5120s. And 2 $1320 Cloverton X5355s are beaten by 2 $579 Opteron 2218s. Even if you include the MB and 8 1GB PC2-3200 ECC buffered DIMMs, the 2218 Opterons still only cost $2378 vs $2640 for just the Clovertons.
I suspect that the reason SPEC CPU 2006 results weren't submitted is that both int_rate and fp_rate scores drop dramatically. Server workloads and working sets are even bigger and likely show further quad core slippage.
Pete
Elmer Phud:
Use SPECfp_rate2000 due to its larger working sets. Two 1.86GHz@1066FSB Woodcrests matched 1 2.67GHZ@1066FSB Kentsfield (65.0 to 65.0). The same cores but with one FSB shared by 2 cores in Woodcrest to one FSB shared by 4 cores in Kentsfield. The clock had to go up 44% to get the same score. That is the FSB penalty.
2 2.6GHz Opteron 2218s beat 2 2.67GHz Cloverton X5355s by 12.5% (117 to 104). That's right, 2 Opterons beat 2 Clovertons with lower clock and half the cores. And the Clovertons were comprised of cores that get 4 DP flops per cycle versus 2 DP flops per cycle for the Opteron cores. The FSB is wiping out a theoretical advantage of more than 4 times.
And the working sets of the work loads server customers test with are much larger than that of SPECfp_rate2000. And that is the target market of Cloverton.
Using SPECint_rate2000, 2 2.67GHz Woodcrests beat 1 2.67GHz Kentsfield by only 3% (112 to 109) and much of that is likely the FSB upgrade to 1333MHz. The 2 1.86GHz Woodcrests did 82. If we use the same clock speed by boosting that score by the clock speed ratio, we get 118. Using the clock speed scaling of SPECint_rate, that drops to 110. That shows that SPECint_rate2000's working set fits in the cache. SPECfp_rate2000's working set doesn't.
I see that no SPEC CPU 2006 results were submitted for the Kentsfield or Cloverton. Given that their working sets are larger, I can see why.
Pete
DrJohn:
Thanks for proving that you don't get it. Intel's quad cores do poorly when the data set gets larger. Far from being the FPU monsters originally claimed, they lost against the same number of K8 cores. In one case, 2 cheaper slower clocked Opteron 2218s (2.6GHz) were more than 12% faster than 2 top end super expensive (2.67GHz) Clovertons X5355s (117 vs 104 SPECfp_rate2000). And they got slower as the more sockets were tested.
Core 2 has the ability to do 4 DP flops per cycle. Yet Opteron with only 2 DP flops per cycle beats it every time. In some cases with half the cores. Clearly there is a big bottleneck somewhere else. That somewhere else is mainly the FSB.
2 Woodcrests clocked at 1.86GHz@1066FSB do as well as 1 2.67GHz@1066FSB Kentsfield (65.0 vs 65.0). What is the major difference? The Woodcrests are on separate FSBs while Kentsfield has only one. On SPECint_rate2000, 2 2.67GHz Woodcrests beat 1 2.67GHz Kentsfield, 112 to 109 (the above 1.86GHz Woodcrests did 82). Which goes to show how much smaller the working set of SPECint_rate2000 is compared to that of SPECfp_rate2000.
Server customers aren't as gullible as PC customers either. They test the servers using their applications and workloads. These working sets tend to be even larger than that of SPECfp_rate2000. Quad core Kentsfields and Clovertons may do well on benchmarks used by reviewers, but they do poorly on those done by server customers.
You should take notice that SPEC CPU 2006 results weren't submitted for Cloverton and Kentsfield. Both parts have larger working sets than the same parts of CPU 2000.
Pete
Elmer Phud:
Perhaps you better look again and use peak scores this time.
SPECint_rate2000:
The best QX6700 score is 109 (4x1x4).
The best 2220SE score is 90.3 (4x2x2).
The best 856 score is 90.5 (4x4x1).
The best X5355 score is 200 (8x2x4).
The best 2220SE score is 175 (8x4x2).
The best 885 score is 279 (16x8x2).
SPECfp_rate2000:
The best QX6700 score is 65.0 (4x1x4).
The best 2220SE score is 119.0 (4x2x2).
The best 856 score is 106 (4x4x1).
The best X5355 score is 104 (8x2x4).
The best 885 score is 182 (8x4x2).
The best 885 score is 231 (16x8x2).
Given the above, in SPECint_rate2000, Kentsfield QX6700 is only 21% faster than 2 Opteron 2220SEs. 2 Cloverton X5355s are only 14% faster than 4 Opteron 8220SEs. 8 Opteron 885s are 40% faster than 2 Cloverton X5355s.
In SPECfp_rate2000, Kentsfield QX6700 is only 55% as fast as 2 Opteron 2220SEs. The QX6700 is only 91% as fast as 2 Opteron 2210SEs. 2 Cloverton X5355s are only 57% as fast as 4 Opteron 8220SEs. 2 Cloverton X5355s are only 87% as fast as 2 Opteron 2220SEs. 8 Opteron 885s are 122% faster than 2 Cloverton X5355s.
So for SPECint_rate2000, Intel quad cores are only 14-21% faster on a per core basis, but lose big being only 55-57% as fast (Opterons are 75-82% faster) in SPECfp_rate2000 on the same per core basis. What is telling is that 2 top end Clovertons can't keep up with 2 lower clocked dual core Opteron 2218s in fp_rate. That must hurt.
Wonder what SPEC CPU 2006 will say?
Pete
Elmer Phud:
And many Opteron 2 socket systems wipe the floor with QX6700 Kentsfield in SPECfp_rate2000:
http://www.spec.org/cpu2000/results/res2006q4/cpu2000-20061030-07857.html
http://www.spec.org/cpu2000/results/res2006q4/cpu2000-20061016-07641.html
92.1 (Opteron 2220 SE) to 65.0 (QX6700). The slowest socket F Opteron (2210) pair is 10% faster than the fastest Kentsfield, 71.4 versus 65.0 even though they have less than 68% of the clock.
http://www.spec.org/osg/cpu2000/results/res2006q3/cpu2000-20060721-06621.html
A pair of 1.86GHz@1066FSB Woodcrest 5120s is as fast.
http://www.spec.org/osg/cpu2000/results/res2006q3/cpu2000-20060626-06261.html
I call a Kentsfield QX6700 getting beaten by a mere pair of Opteron 2210s as getting trampled. A $1300 CPU getting trounced by 2 2210s, MB and 4x1GB Reg DDR2 PC2-3200 DIMMs costing less than $1200. That FSB rears its ugly head again.
Wonder what SPEC CPU 2006 will show?
Pete
Dear Smallpops:
There are two other coprocessor types in addition to the current FPGA they will have to put in Opteron sockets in XT4. The first is a vector processor derived from the X1E. The other is a Multithreaded processor (XMT). This is the press release from Cray:
http://investors.cray.com/phoenix.zhtml?c=98390&p=irol-newsArticle&ID=930804&highlight=
Here is more on XT4 and XMT:
http://www.cray.com/products/xt4/index.html
http://www.cray.com/products/xmt/index.html
You better read these before you say something stupid.
Pete
Dear Smallpops:
Evidently you didn't read the press release. XT4 is based on Socket F and uses Seastar2 interconnect while XT3 used Seastar interconnect. XT4 also uses Torrenza as Opteron sockets can be filled with FPGAs, Vector processors and Multithreaded processors. You can mix and match as you need. Up to 40K sockets.
I think that deserves a model number change. You don't complain about all of the inconsequential Intel model number changes and they have far less differention. They haven't got off the FSB since the 8086 days. Now that's a joke.
Pete
Dear Herb:
These complainers are probably the same ones who when asked for orders for three months from now only ordered 30-50% of what they really thought they would sell. Who knows why? Perhaps they thought like some here that CMW would do better than it did. Perhaps they thought that if they needed some they could get it from someone else.
Even if AMD is filling every order by the date promised, these low ballers would complain. Would they say if they didn't order enough? No. But they complain when others get their previously ordered product, have plenty to sell while they have sold out what they ordered and have none. IF AMD had more orders, they would have had Chartered make enough to cover them while they ramp Fab 36 and transition to 65nm. They probably even made more than what they promised and shipped that to people where AMD promised them dates in the future. Most dealers would simply take them and have even more to sell. A few will refuse delivery and ask that it be delivered on the scheduled time (JIT OEMs especially).
What likely will happen is that AMD will ramp more than real demand and there will be a return to where there is loose stock in the channel for those who under ordered. Q1 is likely to be better than seasonal just like it was this year. If Q2 is better than Q1, then we know AMD hasn't caught up yet to real demand, that is the demand of the end users who will go to many different dealers until they find the CPU they were looking for. The different dealers all wanted to sell it so placed orders to do so. This aggragate demand can be higher than real demand that finally shows up when a product stops being a hot seller.
Pete
Dear Mmoy:
Yes multithreaded programming is a bear. Those that do device drivers and systems programming have to do it a lot, like what I am doing now. Once you get good with it and it takes years to develop those skills, it can still fustrate you from time to time.
Still the question for uses for multicore on the desktop must consider the environment they operate in. And the requirements of the user.
First, the easiest environment to take advantage of multiple cores is one that has many active programs running. If you have more programs or tasks than you have cores, you fall into the normal well solved problem of who does what.
Second, power users also have this tendency to have many active tasks or tasks that are inherently parallel.
Third, you have inherently parallel tasks like rendering and transcoding. These though can usually be offloaded to specialized hardware that can do the job more efficiently. Rendering can be better done with a GPU which is optimized for such a task. In many video capture boards is DSP hardware that does the heavy lifting of transcoding allowing the CPU to do the control and I/O stuff that it is good at. Spending $1000 for a GP CPU to do this when a $100 add in board does it ten times as fast is penny wise, but dollar foolish.
Fourth comes the average user who may do one active thing, but a few things in the background like DVD burning or downloading. This is ok for dual core, but not more than that.
Fifth is the hardest, the user who does one thing only at a time. Finishes that first before doing the next thing. Unless that thing does many things to that task, it doesn't need or use more than one core.
At least most newer multiple core CPUs now will either reduce the power to the other cores, slow or halt those other cores or better both, when only one core is needed.
Quad core may make sense for the power users among us like me or for servers. For most users, dual cores is the most that they need. The occasional user just needs one core.
In the long term future, we may all be power users with dozens of software agents doing all of those mundane tasks we don't want to be bothered with (looking up the news, finding rumors on our favorite sports teams or gossip on our various groups, paying the bills, making sure the kids do their homework, sending cards to everyone on their significant days like borthdays, etc.), then we may all want as many cores as we can afford. But it will be at least a generation (10-20yrs) before the majority gets to that point.
Pete