Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Dear Jackthex:
Intel can not use flash to subsidize CPU sales mainly for three large reasons. The first is that when they tried to increase prices (15% IIRC), they lost half of their sales, IIRC, dropping them further into the red. Spansion would have to follow suit and that would give large sums to it and through minority interest, to AMD. Second Intel has much higher ASPs for CPUs than AMD ($150 to $86 IIRC) and flash has lower gross margins to boot. Third and its a killer, Intel has IAG revenues 12-13 times the size of their flash revenues, flash would have to grow exponentially to have much of an effect and that is definitely not in the cards.
These are the same reasons why Intel can dump flash at slightly above marginal costs, but far below breakeven, and still hurt AMD. Over time this hurts GMs, and sooner rather than later, investors will start screaming at Intel to desist.
Pete
Dear Wbmw:
IA32 has a 4GB address space. Windows takes the upper bit as 1 to be for the OS and 0 for the application space. That divides the address space in two. For Win64, it has a 1TB physical space and a 256TB virtual space. The MSB bit being 1 gives 128TB to windows in the virtual space and 128TB to applications. Not many systems have 256TB of memory so there is a nested set of tables (4 levels in A64) that convert virtual addresses to physical ones. Thus only 200MB of physical memory, any I/O and device memory can be mapped to that upper 128TB of OS virtual address space and the rest of the memory can be mapped to the lower 128TB for applications.
You will likely only see a need for more space when main memories approach 128TB something only huge supercomputers are up to yet. 32 bit apps on Win64 can use most of their 4GB of space for their own use as the OS API needs just a few MB for its use (not much is needed because the OS sees all the physical and virtual memory).
This is why Linus Torvalds, many OS designers and programmers said that address spaces larger than 4GB were needed even before memory was above 1GB. Many here said is was not needed, but they didn't have to partition, map and assign memory spaces. Abstraction also further fragments memory spaces and the number of levels of abstraction continue to rise causing the ratio between physical memory and virtual address space to become much larger. The first example requires a ratio of 2. A single device abstraction layer may push the ratio to 4. It doesn't take much for VMs to push ratios to 8 or more and 4GB space can be overrun even with 512MB of memory which most PCs now come standard with.
X86-64 will allow this even to memory sizes measured in double digit TBs.
I hope this helps Wbmw.
Pete
PS the 4 levels in x86-64 use 4KB pages which are filled with 512 40 bit physical addresses or lower level table pointers plus flags. 512 4KB pages make the next level have 2MB pages. 512 2MB pages make for 1GB pages. 512 1GB pages make 512GB pages and 512 512GB pages make for the 256TB address space. There is a mode that skips the first level and starts with 2MB pages. It reduces the amount of memory this multilevel map takes. 8GB of memory needs only 1 page in level 4, 1 page in level 3, 8 pages in level 2 and 4K pages in level 1. That uses 16MB to map/assign 8GB. With only the upper three levels, it uses 40KB and a lot less TLBs in the CPU. Adding 1 page in level 4 and 1 in level 3 adds the OS to the map which adds only another 8KB to the map to allow the OS to be in a seperate virtual address space.
Dear Wbmw:
Opteron BL35 supports 8GB. http://h18004.www1.hp.com/products/servers/proliant-bl/p-class/35p/specifications.html
The row in your link lists standard memory, not maximum memory.
Pete
Dear Jack:
Fermilab did a Xeon eval in 2003. Here is the burnin report: http://www-oss.fnal.gov/scs/qualify2003/burnin.html it had 8 days of failures out of 357 days. That's a 97.7% uptime with a 1U space between each 1U dual CPU server. Thats not counting a unit that was sent that failed and the service call did not show up disqualifying the system. It would have doubled downtime pushing the uptime down to 95.7%.
Here is the final report: http://www-oss.fnal.gov/scs/qualify2003/2003rep.doc
Pete
PS: for those who try to compare these numbers to commercial quoted times, they consider a down day if the system was not up that entire day. so a 1 minute reboot in a 1440 minute day is 0% uptime by Fermilab, but a 99.93% uptime by commercial vendors. Thus add 2 9s to the front of these uptimes and mover decimal 2 places to the left, thus 97.6% becomes 99.976%. The latter number is about what a enterprise 24/7 server actually does.
Dear Wbmw:
Did you notice that that datasheet does not cover any Mobile CPUs? None of the MA64s are in it. None of the 90nm Mobiles are on it either. No Turions or mobile Semperons. The Mobile version isn't publically published yet.
Also note that emperical tests put the 90nm A64 3500+ under 35W (actual reading 30.7W) using a power virus. So even a desktop A64 2.2GHz 939 can be a ML40. Probably 90% (or more) of them can and these are D4 parts, not the better lower power Ex parts. Mike's distis, Newegg and Monarch have been selling thousands of them, if not tens of thousands.
Pete
Dear Mmoy:
I personally know of quite a bit of 64 bit ready software that can just be recompiled to XP64. These are ports that may take a few weeks at most due to bugs and work arounds in the new environment (OS, compiler and tools). The applications run on AIX, OSF/1 (Tru64), Solaris, UNIX, MVS or Linux of various flavors and, of course, shoehorned into x86. These are mostly VAR type software, but do include such standards as RDBMS, CAD/CAE/CAM, Multimedia, Financials, Insurance, Medical, Process, Engineering and Simulation. Typical consumer type software usually comes from the Multimedia and Simulation sources as Video manipulation and Games.
Soon moving 4GB by email may be as little thought of as moving a 2MB (a couple of floppies worth) file is now and a 2KB file (double spaced typewritten page) was 10 years ago. Transistions are fast when the upgrade is both cheap (nearly free) and easy to use. Just look at how fast double the speed DVDROM drives were adopted or dual standard ones. The new ones were as cheap as the old ones yet were faster and more usable. They didn't take 2-3 years to adopt. Ditto for 4hr (LP) VHS VCRs and tapes over 2hr (SP) ones (or 6Hr (EP) over 4hr).
GPS is not a good analogy because of its initial expense >$1K in cars. If it was $69.95 installed, it would probably sell like hot cakes. Its not there yet.
Pete
Wbmw:
There is a lot of 64 bit software out there that was either a) originally on 64 bit machines and ported to x86 and b) on x86 machines and then ported to 64 bit machines. Either of these will be quite easy to get to be AMD64 applications.
And you seem to fall into that common misconception that you must have more than 4GB of physical memory to need 64 bit address spaces. This is patently untrue. There are many applications that use >32 bit virtual space like operating on a 10GB 2 hour HDTV (.ts) transport stream. This file is loaded into virtual memory and operated on as if you had 10GB of main memory. Its actually demand paged from disk as needed. You can have 3 copies of this stream, original, modified and undo buffer. Thats 30GB plus the application program and the OS. That puts you to needing a 36 bit address space even though you may have only 1GB of physical memory.
In case you think this is not typical, just look at IBM and the Power series. That is exactly what they do to speed up file I/O in AIX applications, use the virtual memory accelerator to demand page file data in and out (only when written or marked dirty). The CPU while it waits for the data to be read, can do other work on a different process or thread. Laptop HDs easily exceed 4GB (more than 20 times that).
You expound on PMs 2MB L2 cache, but hate it when the whole physical memory is used as a cache for disk since with only a 32 bit address space, Dothan and later, Yonah, can't use without of a lot of special coding, larger applications and slow work arounds.
A person with outstanding credentials of Linus Torvalds stated that for the last few years, they have run into the 4GB limit and had to specially code around it. AMD64 frees them of those artificial constraints. K8's 48 bit virtual address size or 256TB can esily cover total HD capacities of 128TB (typical division of virtual address space) which is larger than even most large servers have attached, much less any PC or workstation. Thats more than $64K of disk alone (200GB HDs exist for about $100, especially in the amounts required), not to mention cabinets, power supplies, controllers and networks to attach it to your super PC.
In short, you need more than 32 bit addressing long before you need 4GB of physical memory, although every little bit helps.
Pete
Dear Chipdesigner:
If he was really serious about Sis making better controllers than AMD, all they would need to do is add one or two in their NB part of the chipset. AMD can use off chip memory controllers with HT, cHT is only needed because they connect to off chip caches in the other CPUs. So Sis can prove their memory controllers are superior. They can even be compared to access time wrt remote 2P memory access times. I think that Sis can't do it better than AMD, else they would do it. You know there would be a lot of MB manufacturers that would use a chipset allowing 8 unbuffered DIMMs on a one socket MB. Getting 10.4GB/s BW, 45-50ns on die and 50-55ns access latency off die (off die 16/16HT at 1GHz is slower than on die 32/32HT at 1.8-2.6GHz) with 8GB of LL PC3200 is better than 6.4GB/s,45-50ns and 4GB LL PC3200.
But since Sis doesn't, they likely can't do better and they know it.
Pete
Morrowinder:
You need reading lessons. Xeon is losing revenue marketshare, they may not be losing absolute revenue, yet. And AMD didn't lose spots on the top 500 list, they had the same amount IIRC. And where are 4P Xeons with EM64T? Not shipping or "released" yet. Also where are the <68W TDPmax Xeons? The <56W TDPmax Xeons? The <36W TDPmax Xeons? Oops, none on the roadmap for 2005. I see nothing at all new for Xeons for 2005, so I guess that an unbiased observer must agree that Xeon will continue to lose revenue market share.
Pete
Dear Morrowinder:
Your facts seem to be mere fog. First there is no story about the Weather Channel swapping out Opterons. If you saw one, I want to see the link. Second going from 3-4% revenue server share to 10% revenue share in servers is not low growth, that's tripling revenue market share. When has Intel last tripled server revenue market share? Tripling from here gets you to 30%. Also if Opteron is selling poorly of 10% of unit share after 18 months, then Itanium at less than 1% of unit market share after 4 years, has to be considered almost flat. And Xeon is losing revenue market share.
As to EM64T sales, you need to add in A64 sales since many of those EM64T sales are to 1P servers and workstations also served by A64. Thus Intel did not ship 3 times the x86-64 CPUs that AMD did. They didn't even reach parity yet. You cite these paper product that Intel will sell sometime in the future, yet many will not arrive next year. Intel said they would deliver a lot of things that never arrived. And what they do promise gets slipped back a few months or quarters and even then products are paper launched with shipments in the distant future. And their roadmaps seem to change every month.
In comparing the current roadmaps of the two, AMD appears to launch more relevant CPUs than Intel.
Pete
Dear Mmoy:
Simple fact that none of the A64s have a Coherent HT link and that they have one noncoherent 16/16HT link for IO. An Opteron 1xx has 3 noncoherent 16/16HT links. An Opteron 2xx has 1 coherent and 2 noncoherent 16/16HT links. An Opteron 8xx has 3 coherent 16/16HT links. You need at least 1 coherent link in each CPU to pass cache coherency information back and forth.
Pete
Dear Rlweitz:
There are a lot of crappy little underpowered internet PCs. Cell phones, PDAs and set top boxes, many of which still sell well despite being under powered. Many decently powered have larger power requirements. Geode is an under 1W x86 CPU. What has more performance yet less power requirements and still be x86 capable? If and when more performance is needed, upgrades to the Geode NX series or to the under 7W K8 ULP series due in 2006 are the likely paths.
The former brings AXP all around power that would match most of the current low end laptops and above all of those other types above plus most current game consoles. You know that K8 is beyond most of the PCs currently in use, ULP K8 in 2006 may be beyond all, but the top end PCs today. Does the PIC target market really need mainstream gaming performance?
Pete
Wbmw:
No he meant 32 bit compatability mode which runs 32 bits apps and 64 bit apps simultaneously on a 64 bit OS. And has SGI ported all those hundreds of thousands of esoteric 32 bit apps? No. Not even close. Not even by orders of magnitude. Yet they run just fine on a 64 bit OS running on AMD64 with no recompilations necessary assuming that there is anyone left with the source code (and much is missing even that possibility). With IA64 every app needs to be ported. With AMD64 only those that need it like a RDBMS or perhaps can benefit from it like en/decryption tools need to be ported. IA64 requires the 12MB ACME glass cutter control program to be ported, AMD64 can run it as is. When looking at 100,000's of 32 bit programs, 800 64 bit ones seems like a drop in the bucket. And if anything, I'm conservative.
1 vertical app I recently installed had 12 main control programs, 157 assistant programs, 25 reports on a third party report generator, 66 utility programs not including the bunch used by Oracle RDBMS for the main server. Only Oracle's RDBMS needed to be 64 bit and some of its utilities. It runs on RedHat Enterprise 3.0 ES 64 bit Linux. The rest are not and don't need to be. Once they may get benefits from being 64 bits, they may move over at the owner's discretion. The vendor refuses to put the server (Windows, AIX and Linux only) on Itanium and only wants Windows (XP or 2000(NT)) for clients (I don't agree with their reasoning). And that is another reason why many apps won't get ported. Sheer unwillingness to support.
Pete
Wbmw:
The numbers you posted are irrelevant to the discussion since they measure the "memory" foot print. Working set is usually far less. Working set is just that memory needed at any given time or can also be defined as the the amount required within the total access time of the next (or slower) memory including the load from time (or store to time). The virtual memory footprint is the set needed within program run time. The set needed within hard drive access is somewhat smaller. This is known as the physical memory footprint and is likely the figure you quoted. The time needed to load from physical memory defines the cache working set size.
Each layer of cache has its own working set. Typically most program's cache working sets are larger than the amount of L1. SPECfp was made to be bigger than the L2 typically found in computers in 2000. Now there exists a lot of high end CPU with larger than this in the top level of cache. Even some mainstream CPUs have more than enough L2 for many of the SPEC programs. 300.twolf probably has a small cache working set and fits in most L2s and it might be one of those serial programs that do not do as well with either long pipelines or huge long latency caches. The 46% just shows the working set fits entirely in cache and that clock speed alone (50%), all else being equal, shows the improvement.
When a program's cache working set fits within cache on the newer one and doesn't on a smaller one, you get the huge boost. Sun's trick allowed a program that had a cache working set far larger than their CPU to be reworked to make the set smaller than their cache and it showed the 1000% boost. If you could rework the algorithm in a program to allow the working set to fit into the previous cache level with lower latencies and larger BW like it fits in L1 when only L2 was large enough, you can get those 2x and up speed ups. Like when a program used to fit in the L2 of a P4 now fits in the L1D and trace (L0I) cache, you could see a 350% boost (x4.5 in score).
Its easiest to see this effect on a Mflops vs matrix size shown with Linpack graphs. When it fits in L1, the Mflops quickly rises to a peak and then drops to a plateau. When the L1 is exceeded, you see the Mflops drop off a cliff and plateau at a much lower level. When it exceeds L2, you get the same effect till you reach the sustain level of main memory. What the document cite is that "size" for a whole program, which by SPEC's goal, stays below 256MB. If the Linpack graph was extended to beyond main memory, another drop off would be seen until we reach sustain from hard drives latency and BW which is now about a million times slower and has 1-2% of the BW of DDR.
BTW, many novice computer users think they need a new faster computer and are surprised when they don't see any change with a "twice as fast box". What happened is that their memory footprint just got larger than their physical memory and it started their system to swap (known as thrash) to the hard drives. Their new system was purchased with the same or slightly larger memory which, of course, didn't stop the swapping. I put twice the memory in to the old box (or remove those code bloating TSRs till it stops swapping) and behold, it runs circles around the new box.
Pete
Dear Kpf:
Chartered says they have 2,000WPM capacity already paid for. They can get to 7,000WPM for a $400 million investment by a third party. For $700 million, they can get to 9,000WPM where they will "breakeven". That $700 million buys 7,000WPM of incremental capacity and a third party could ask for that additional capacity be reserved for them when they need it. Considering that 10.75KWPM Fab 36 costs $2.5 billion or $430 million per 1KWPM, getting 1KWPM per $100 million is a great buy. I assume that for $1 billion, a third party would get 9KWPM and $1.3 billion to 11KWPM or a little more than Fab 36. I took the larger 7KWPM figure stated by Charter for my capacity estimates because it was firm.
The second half is that Chartered would start these extra wafer outs at 90nm beginning Jan 2005 on a 90nm SOI process (probably IBM's). If AMD was willing to come up with a 90nm K8 Semperon receipe for IBM's process, they could begin manufacturing K8 Semperons by 2H05 when Fab 30 starts to run out of steam (if the demand is there). Then both IBM at Fishkill and Charter's Fab 7 could be used to ramp output of low end 90nm K8 Semperons while Fab 30 concentrates on dual cores, servers, high end PCs and notebooks.
Even giving Charter a 100% markup over cost as incentive, AMD could sell millions at $60-80 ASPs and still get 33% GMs. That's 10-14 million of those K8 Semperons per quarter (20-28% unit share and 10-14% revenue share) for a total of $700 million with a return on investment of $280 to $420 million a quarter. Thus nearly paying the investment up by 2H06 and still getting 2.4x of current Fab 30's capacity with the investment paid off. That allows Fab 30 to make 5-7 million $150-250 ASP CPUs at 70% GMs or $0.7-1.0 billion a quarter pretax earnings.
The kicker is that 65nm would still ramp in 2006 and Intel will be faced with a competitor having 30-40% of the unit and 25-35% of the revenue market share. If demand ramps more slowly, they allow Charter to sell those wafers to their customers with either no or small return and used that reserve capacity for the 65nm ramp. AMD wins either way. But the first is much bigger and more profitable to us and better for all the OEMs because it takes out Intel's "we won't supply you" argument by end of 2005 instead of end 2006.
Pete
Wbmw:
Current capacity of Fab 30 at 130nm is 5500WPW of 200mm wafers or 71500WPQ times 31416mm2 or 2.246x10^9mm2/Q. Divide that by (130 times 10^-6mm)^2 and you get 132*10^15units2. Fab 36 has 2500WPW of 300mm wafers at 65nm. Plugging in those numbers gets you to 2500WPW*13WPQ*70686mm2/W/((65*10^-6)^2) equals 543*10^15units2. Thats 4.09 times that of Fab 30 at 130nm. So Fab 36 is 4 times that of Fab 30 on a per square unit of the process dimension. That's for straight shrinks.
Yes, to get best performance and not unit output, you must resize the relative dimensions of transistors and the logic vs transmission line area vs decoupling capacitance area ratios. Thus a 90nm K8 Semperon is not half as large as a 130nm K8 Semperon, its 58% as large (69mm2 vs 121mm2). Similarly a 65nm K8 Semperon is about 60% as large as a 90nm one or 41mm2. That means that about 3 65nm K8 Semperons use the same area as a 130nm one. The additional unit output comes from a smaller wastage percentage in the 300mm wafers and the smaller die size will have less area in partial dies around the edge. Thus you get about 200 full 121mm2 dies in a 200mm wafer (depends on the exact die size and how near they are to square they are) and about 675 full 41mm2 dies in a 200mm wafer. You get about 500 121 mm2 dies in a 300mm wafer and 1600 41mm2 dies in a 300mm wafer. Thats like 3.6 the dies in real world terms. So Fab 30 at 90nm and 36 at 65nm will have 5.6 times the capacity of Fab 30 at 130nm.
Charter's Fab 7 will have 7000WPM of additional capacity for about $700 million. Assuming AMD would take all of that capacity (and pay for it), Fab 7 would have about 65% of Fab 36's capacity for a total of 8 times Fab 30's 130nm capacity in real world terms. Given that Fab 30 can produce 8 million CPUs per quarter at 130nm, that means by the end of 2006, AMD could produce 64 million CPUs of roughly the same mix. That is 125% of the expected demand. With Fab 30 and 36 alone, they could produce about 45 million or 88% of the expected demand. Of course the mix may shift towards smaller lower cache CPUs and they could make 51 million of those with just Fab 30 and 36 or 100% of expected demand.
All in all, adding in Fab 7 incremental output allows for CPUs transistor counts to climb and still have AMD able to supply world demand for GP CPUs. Intel could no longer threaten to cut off any customer without having them defect to AMD or another supplier (like IBM or Via). 2005 allows AMD to increase share as Fab 30 goes 90nm with the option of using Charter's 90nm process to fab low end Semperons to boost capacity during Fab 36's build and install time, if needed (demand over 12 million K8s a quarter). During 2006, it allows for a faster ramp of 65nm for the push to 33% (and up) market share.
Pete
Dear Petz:
FP adds need a 64 bit barrel shifter to align the mantissas at the same exponent. Thus while k8 has 3 64 bit adders, it has only one 64 bit barrel shifter. Adding a second barrel shifter is not that much real estate, but would allow a second FP add unit. FP multipliers use a 64 bit integer multiplier. There is only one in K8. Making a second one allows its use both for integer and FP multiplies. Again it is not much in area, but it would double FP and integer multiplies per clock. These would add maybe 200-400K transistors to the die, not much when compared to the 100 million already there. Doubling L1D and L2 port width to 128 bits may add a few K more in transistors but another mm2 of die for the wider busses would help further.
They may add 1W to the TDPmax, but it would be worth it for increasing SPECfp by 40-60% (1700->2400-2700 at 2.6G) and SPECint by 10-30% (1700->1870-2210 at 2.6G) at same clock. Then at Opteron would be at the top of the heap in SPECint at 2.4-2.6GHz and in SPECfp at 2.8GHz. It would also blow past in Linpack and synthetic tests.
But IMHO, neither is in stepping E0.
Pete
Dear Keith:
They can't get 2 846HEs for $700 each. Or even 840EEs. Look it up. And those are in high demand by all indications. Especially since you get 4 way 8xx type performance from dual core 2xxs or 8 way 8xx from 4 way 8xx duals. The reduction of MB and chassis costs for twice the ways can easily make up $1K or more of premium by itself. Besides if the die costs are 3 times higher yield and capacity wise, why shouldn't AMD get all of that back in ASP for such premium product? It makes far more business sense than your theory does. Intel has no trouble in doing that. They get over $3K for top end Xeons and $5K for top end Itaniums last I heard.
To put your theory in for Intel, Xeons shouldn't get more than what Opterons get for that amount of performance, $200 or so, $700 tops. Sorry it doesn't work that way. We shall see when pricing does come out, but IMHO, the pricing ratio for premium A64FXs (say 55) wrt A64s of 70% the performance (say A64 3000+) will exist between duals (160% 3 grades down) and top end single cores. The five grades down duals will be at lower price but still well above top end singles (130% performance say FX55 wrt A64 3200+). You see the premium AMD charges for 30% more performance? For 60% more performance? I thought most would finally see the light.
Pete
Dear Keith:
Did you forget the 846HE price of $1,165? That is two speed grades down from the $1,514 850. The dual core would likely be equivalent to two of those plus some extra for a density premium. 2 * $1,165 + 2 * $418 (2 way to 4 way premium per core (look it up between a 246 and 846)) = $3,056. Similarly between the 4 grades down 840EE price of $1,165, 2 * $1,165 + 2 * $500 = $3,230. Since I think that AMD could release the 852 Opteron right now at a premium to the current x50s, that would be the 3 grades and 5 grades down versions stated by AMD as possibilities for duals. And the top dual may be faster than three grades down from the top single.
Pete
Dear KeithDust2000:
When top end Xeons are currently selling for over $3K, why wouldn't AMD charge $3K for top end Opterons? Since Opterons already perform nearly twice as good as top end Xeons as it is, there is some justification for AMD to charge even more than $3K. Opteron 850s get $1,500 now. A dual core Opteron 2xx can easily justify twice that since 2 of them make for four cores on a cheaper 2 way platform versus 4 singles on a more expensive 4 way platform.
Now that difference would allow dual core 2## Opterons to fetch $3.5K or so and still make 2 socket servers cheaper than 4 socket single core 8xx Opteron servers. Similarly dual core 8xx Opteron 4 socket servers could fetch $5K each and still be cheaper than 8 socket single core 8xx Opteron servers fetching $1500 now. And 8 socket dual core 8xx Opteron servers could fetch $7.5K and still be cheaper than 16 socket Hirus based servers filled with single core 8xx Opterons at $1,500 each.
Add in the density increase premium, and its easy to justify more than $3K ASPs. Before AMD might have priced them below this, but with the performance crown and no infrastructure changes needed, the current AMD will charge whatever they can get. This will reduce demand in the beginning, but increase revenues and most importantly, profits while AMD has your dreaded capacity problems. It also gives time for OEMs to validate and boost offerings as well as time for AMD to work out any bugs that may crop up. When Fab 36 begins production, those constraints fall away and AMD could drop prices to gain share in all the various markets.
You would thought that AMD could, never ever, get $1K for a CPU five years ago, but now they have many over that price. Times are changing.
Pete
Dear Bobs10:
Another possible mess up will be the classification of PICs. If Xbox was allowed to boost Intel's unit market share then x86 based Geodes should do the same for AMD. 10 million PICs a year could throw a monkey wrench into the market share figures even though it would not have a material effect in the revenue market share numbers. Heck that could be the whole of the unit market share increase you talk about. 2.5 million more PICs would boost AMD from 17% to 21% all by itself. I think that AMD with a solid amount of K8 Semperons will go beyond 10 million a quarter average in 2005 plus any Geode sales. That makes for about 12.5 million in a 50 million market of 25% unit share. Doubling (or more) PICs in 2006 could boost that share to over 30%, AMD's target for end 2006 without Fab 36. With it, well we shall see, HBABG.
Pete
Dear Golfbum:
You forget that Fab 36 is a 65nm 300mm fab. At 65nm, the die sizes shrink more as 65-69mm2 becomes 36-39mm2, 84mm2 becomes 46mm2, 114mm2 becomes 63mm2 and 203mm2 becomes 112mm2. Now place those on a 300mm wafer and you get about 1500, 1150, 840 and 470GDPW respectively. At 2500WPW planned at full production IIRC, thats enough for 49 million K8 Semperons, 37 million 512KB K8s (successor to Winchester), 27 million Opterons or 15 million dual core Opterons per quarter or some mix thereof. If mostly K8 Semperons, Fab 36 has enough capacity by itself to supply the entire market. Fab 30 could make another 24 million K8 Semperons to allow more of the other types in Fab 36. Heck you can get 100% of the market with Fab 30 making only K8 Semperons and Fab 36 making just A64s and Opterons with over 10% being dual core.
And we have seen through benchmarks that large caches are less meaningful to A64s even in games so 256KB K8s are still solid gamer CPUs compared to 1MB and 2MB Prescotts and 2MB Dothans.
Pete
Dear DDB:
The L1D cache is 5 ported. You can get 5 64 bit values per clock, 3 integer and 2 floating point. That's why there are 3 integer AGUs. So you are correct, if only dealing with floating point (or SSE(2,3) in the current design). That may be why AMD's caches are larger per byte than Intel ones. Also the L2 is 16 way to Intel's 8 way with 64 byte cache lines to Intel's 128 byte ones. That puts 4 times the overhead for AMD's approach than Intel's, but that may be more of the work smarter, not harder design goals that work so well for AMD lately.
Pete
Dear DDB:
Don't forget than a large part will be 90nm K8 128KB/256KB Semperons at 65-69mm2 die sizes. They will take over from Tbreds at 85mm2 and Bartons at 101mm2. 84mm2 Winchesters take over from 150mm2 Newcastles and 114mm2 90nm Opterons from the current 194mm2 Sledgehammers. The volume of dual cores will be less than 100K at about 203mm2. Since 130nm production has been as high as 8 million CPUs a quarter, 90nm in the same mix will be over 12 million CPUs per quarter and that is about a 30% unit marketshare. Revenue market share will likely be higher as the Winchester ASPs are higher than either average Newcastles or any Barton. Yes the same model numbers are priced the same, but the 90nm get those at lower clocks than most Newcastles.
And if the production of dual cores dampens unit market share, the probable pricing will more than make up for it. 100K at $3K each is a cool $300 million which is more than 300K single core Opterons ($500 ASP) or 500K Winchesters ($200 ASP) would get using about the same amount of wafers. Getting more dualies than that, we should be so lucky.
Pete
Dear Bobs10:
The whole x86 Server market is about 4 million CPUs per quarter, mostly DP Xeons and SP P4s. Intel derives about $2 billion from these, so 10% or 400K CPUs is $200 million each quarter. With 90nm, AMD can make 8 million Opterons per quarter. That is sufficient to allow the server market to double in size and still be able to supply all of it.
Assume that Opteron was selling 100% of this market and it may drop its total unit marketshare to 22-30%, but at far higher ASPs and revenue market share. 4 million Opterons would sell for $2 billion or $500 each and if the other 6-8 million sell at $100 (conservative), then AMD has a CPU revenue of $2.6-2.8 billion or about 36-39% revenue share (overall ASP of $220-240). Given the current 75%/25% fixed/variable cost ratio, AMD would have gross profits of $1.5 billion and $1 billion after tax each quarter or about $2.20 EPS. And that is with no flash included.
As for dual core, if poriced appropriately, say $3K for the top bin, losing 1.5 million of production for 500K is a ASP and revenue gainer to add $1 billion in revenue and move ASPs to $360. That would fall directly to the bottom line and add $1.50 to EPS. AMD would be making $1.7 billion in cash each quarter. We should be so lucky.
Face it! AMD does have the capacity to supply 100% of the server market and reap 80% of all CPU profits. Intel couldn't handle a $3+ billion haircut on revenues, they would lose money (until they stripped away some money losing divisions and shave head count (especially those with large stock options)).
But so much for "pie in the sky" scenarios, the truth is that AMD can supply all of the high end of the CPU market even at 130nm. They can do more at 90nm. 20-30% of revenue share is doable at 90nm just using Fab 30. With Fab 36, 80-90% of revenue share becomes feasible and with 256KB 64 bit K8 Semperons, could even supply 100% of the CPU unit market.
As to a price war, Intel would need to drop its ASPs to under $75 to drag AMD into the red. At that point, Intel is bleeding to the tune of $3 billion a quarter of cash (remember to add in the costs of the stock buy back program needed else the share price would fall to below book of $5 a share). I don't think the shareholders would stand for that.
Pete
Dear Dacaw:
I just turned a 100min movie on a DVD and recoded it to Xvid format 1.1GB AVI in 65min using mencoder under Linux (512MB AXP 2400+). Burning it takes me longer as I use cheap 2x DVD-R media (34min for 4488MB with fixation, etc. using cdrecord-proDVD). But I do have a ti4200p GPU with on board IDCT and motion compensation acceleration. Still 75min end to end for a 100min movie is 33% faster than real time. A $80 CPU with $50 memory, a $40 MB and $100 video card for a total of $270 beats needing a $500 A64 3700+ with $100 memory, $100 MB and some cheap video card ($700 total) to do it via software only. Add in $200 for a HW encoding pcHDTV H3000 board and I am still cheaper for OTA/cable DVR. Add in a $90 200GB drive and its still cheaper ($560 vs $700+).
The wonders of high performance low cost speciallized hardware.
Pete
Dear Dacaw:
That is for terrestrial broadcast. Cable and Satellite may use 38.4Mb/s. DV from camcorders and semi pro decks can be either 25Mb/s or 50Mb/s and that's for standard CCI-601 video (720x480x60i(or x30p)). HD camcorders (1280x720x60p or 1920x1080x60i) use DV's 50Mb/s or 100Mb/s typically in 4:2:2 using 10bit samples. MPEG2 defines 4:2:0 using 8 bit samples only in the Main profile (although there is nothing stopping it from doing 16Kx16Kx60p except data rate and processor power limitations) and 4:2:2 in the Studio profile.
If you look at http://www.yolinux.com , you can see that typical HDTV DVR applications for Linux use the IDCT of nVidia Geforce4 (or later) cards to drop CPU power requirements from 2.4GHz (P4) to 600-800MHz (P3). The latter ti4xxx products also help out with motion compensation. Hauppage actually has HW MPEG2 de/encoders in its $200 Win 350 PVR cards.
Pete
Dear Dacaw:
I think you had better check your data rates for various video feeds. 1920x1080i is picked at 30fps. Full data rate needs 12MHz 16 point tellis VSB. That's 4 bits at 24M samples/s or 96Mb/s or 12MB/s, not 12Mb/s. Of course that is already compressed for broadcast. Output at 1280x768x24x60fps, the best for heavy action, is a raw rate of 168.75MB/s. At a good MJPEG compression rate of 4x, you need 42MB/s for the video stream. That is at the upper rate for most HDs used at home. In addition, that means 300GB for a 2hr movie.
To really do the processing you want, you will need some sort of RAID arrangement as well as 300GB drivers are at a premium. Linux can do software RAID, so you need multiple large drives. This is where Windows falls down. It doesn't do well with huge files spread over multiple disks. Linux runs in 64 bit mode and easily handles files at TB+ sizes on filesystems 100's of TB in size. Granted, not many personal computers even have a TB of disk.
But if you want to edit and process HDTV movies, you need to first invest in disks. Get one of those MBs with either a RAID controller and either 4 IDE channels or 8 SATA links (or both). I saw Gigabyte with one like that. Put it in one of those full tower cases with 12 bays or so. My dad picked up an old one for $5 (10 5.25" HH bays and 5 3.5" bays in a cage below with two more on the top rail) and put in a $99 Enermax 465 (a good deal). Those old cases are bypassed alot so you can get good deals as they are too large to fit under a desk or table. If you need more 3.5" bays, there is a 5 3.5" bay cage that fits in a 3 5.25" HH bays high. So my dad could put 17 3.5" drives and still get 4 DVD burners into his tower. That would be 4TB RAID 5 filesystem for video data and 2 mirrored 300GB for root/swap/home files all in that one tower. He currently uses SCSI with 12 18GB 10K drives, 2 groups of 5 software RAID 5 (72GB each) and 2 mirrored (18GB) for 162GB total using an AXP 2400+ with 1GB RAM and two DVD Pioneer burners. He finds it fast enough to do video editting.
Only then do you go after CPU horsepower.
Pete
Dear Dacaw:
Look at hardware that pulls in an HDTV signal and decodes it to the screen for example. These are done via a video processor like the Convergent, Zoran or Bttv VPUs. Look at this page for a pcHDTV card: http://www.pchdtv.com/hd_3000.html It uses a nVidia card for accelerating motion compensation (the big user of computing power in MPEG2/4 encoding).
Excerpt:
Cost effective ATSC/NTSC TV reception card
Open source drivers and player
All-software HDTV decoder
Supports all 18 ATSC compliant digital formats
Supports NTSC Analog Television
Upto 4 cards supported in a single system for recording and display of multiple programs.
Compatible with the HD-2000 card.
Digital Video Recording, Time Shifting and Playback
Accelerated HDTV support with nVidia video cards.
Accelerated IDCT and Motion Compensation with GeForce4 Mx cards
Accelerated Motion Compensation with GeForce4 TI cards
Selectable Weave or One Field de-interlacing for interlaced formats
Command line support for station scanning
Command line support for station signal strength
Command line support for recording.
Notice it uses a GPU to do motion compensation. iDCT encoding and decoding is done in the GPU as well. In my current computer, an Iomega Buz video capture card uses a Zoran VPU to encode MJPEG (Motion JPEG), video composed of a stream of JPEG compressed video frames. Then all that needs be done for MPEG2 or MPEG4 is to compute the motion compensation information needed to generate the interpolated I and predicted P frames (out of six frames of the stream, one is a B base frame (JPEG compressed feame), one is an I frame and 4 are P frames). Thus motion compensation enhances the compression by 3-5x over MJPEG. You tell the Zoran how much data to save for each JPEG frame (10KB-200KB) which tells you the compression used (data rate of 300KB/s (20x comp) to 6MB/s (2x comp) for NTSC (broadcast uses about 11MB/s uncompressed)) and it does the JPEG compression itself. Audio compression is done by your audio card (mine is the classic SB Live Platinum).
MJPEG is quite large, 20GB for a 2 hour movie, but it is frame by frame edittable and many video effects can be done via Gimp or other such tools. When I had a K6/3@400MHz, it did all the motion compensation when encoding to MPEG2. It took 12 hours to encode a 2 hour movie (MPEG1 took longer, about 48 hours to put it on a CD) to a 4GB file. Now my Athlon XP 2400+ (2GHz) does a MPEG2 to MPEG4 (Xvid) conversion in 1.8 hours for a 2 hour movie using mencoder (I have a nVidia GeForce 4 ti4200p card, but mencoder does not use it).
Notice, with a Geforce 4 or better card, HDTV, typically 1280x768 60fps progressive scan, can be done real time with a 2GHz AXP. There is no need for a 3.2GHz dual core. If you want 3D virtual reality at 2Kx1.5K at 32 bit color 75fps by 2 (eyes) which generates 1.8GB/s of data by necessity, real time, you may need such a beast as you're talking about (current cards like the 6800 or 800XT do tricks to get 1600x1200x32x1 for games at 60-80fps). The physics and realistic modeling are beyond most current GPUs to render and GP CPUs to calculate the graphic vertex lists needed.
True "you are there" 3D immersion requires 10Kx7.5Kx48x75 per eye imaging (900MB per buffer), some 60 times the capability of current GPUs to deliver (for 100K triangle displays done with current GP CPUs although movies use about 10 million triangles to do a decent job (multiply by another 100 times for both CPU and GPU)). Most people conveniently ignore the artifacts generated by today's limited equipment (typical suspension of belief necessary to function in today's games) unless incredibly glaring. The AI required boosts this many fold (like that shown in movies like Matrix or Virtuosity). We are at least 10 years from this level of power just to do the video portion and another 10 years to get to decent enough AI.
Pete
Dear Dacaw:
Video recoding will most likely be done by a special add in IC on most video boards, if not put into the GPU. Here a $10-20 speciallized part will outrun most GP CPUs. The effects will be handled by the GPU, the audio by the audio DSP and the CPU will coordinate all of these as well as input the source video and write the recoded video stream. It will also burn the Blu-ray DVDs. Most professional video workstation already do this. The GP CPU may do "what if ..." short encodings to get a feel for what should/could be done at various points, but the full render will be by speciallized hardware.
Pete
Dear Mas:
How does an "on board" memory controller be read by you to be an "on die" memory controller? The two terms have totally different meanings. The article you read it from was referring to chipsets. They are definitely "on (the mother)board". Wishful thinking on your part, I think.
Pete
PS. That is not to say that after a management shakeup, they will reverse engineer cHT and place some DDR3 or FB on die MCT/DCT. Perhaps even make it socket 940/939 compatible. NIH goes by the wayside when one is desperate.
Chipguy:
Even Intel claimed that assembly level compatability was a key point going from 8080 assembly code to the 8086 assembler. Yes there were lots of new instructions, many improvements and, yes, some rather bad decisions wrt to segmented addressing. All in all the 8086 (20/16) was an improvement on the 8080 (16/8). But it also competed with the similar superset Z80 (16/8) and against two good CISC CPUs, the Motorola 68K (24/16) (address/data bus sizes) and the National Semiconductor 16032 (32/16). The NS16032 was really othogonal, a feature copied a lot in future GP RISC CPUs.
Yes the 8086 grandfathered most of the 8080 to the extent that straight compilations in tiny model (code=data=stack=heap=64K) generally worked with little or no changes. This is what Intel claimed in their romotional literature at the time. Do you disbelieve Intel's own documents and statements?
As to the DMA not being a part of the CPU, that is an implementation detail that you discourage. What Intel couldn't add it to their chipsets for the EMT64 launch? You claim the 36 bit physical address limit is a CPU external problem? No the AMD documents specify a 40 bit physical address minimum size (later CPUs will have even larger physical address buses) and that DMA is 64 bits, but as usual, the kludge couldn't do either. The large errata set shows either how little Intel coould fix or how huge the original list must have been. If that doesn't point to a kludge, you must see all the change wires draped over the package, patches, repairs and multiple overlapping "change order" post-it notes to believe anything is a kludge.
Pete
Chipguy:
4004 had a kissing cousin the 8008 (4004 enlarged to 8 bits). The 4040 4 bit successor to the 4004 was also enlarged to 8 bits making the 8080. The 8086 took the assembly of 8080's ISA added a few instructions, some of which were copied from the Z80, and added that much detested segmented addressing. The 8087 was a redeeming feature, though. A 8 bit data bus version was quickly added and that is what IBM used (IMO they should have gone directly to 16 bit data buses).
The 80186 added some hardware portions from chipsets at the time, but was largely ignored. The 80286 tried to boost segmenting to 16MB and added virtual memory controls, but never was a big boost over the 8086. The 80386 doubled the width of all the integer registers and allowed flat addressing to return, to the sigh of relief from programmers everywhere and almost all OSes had to allow flat addressing or else be left behind. Sure it could run 16 bit code but, it ran 32 bit code and used a 32 bit data bus.
The 486 added pipelining but, it was the last true CISC processor. It also started the ISA extensions that became mainstream. The Pentium added a 64 bit data bus and started the x86 decode to RISC backends seen to this day. The MMX extension set was added during the Pentium reign. The Pentium Pro and later, the Pentium II added OOE to the x86 bag of tricks.
The Pentium III added SSE extension group after AMD led the way with 3DNow, a well regarded extension by programmers. AMD finally split with Intel here with the Athlon (K7). The P4 radically altered the backend by using trace caches and an over long pipeline. It also added the SSE2 extension group. Then NW offered HT and the SSE3 extension group. During this time Intel tried to push a failed ISA type, the VLIW modified as EPIC as their great new proprietory futuristic 64 bit ISA.
AMD stuck to the tried and true upgrade path with the AMD64 Opteron (K8). This added new modes of operation, got rid of the segmented legacy (still used in 32 bit modes) in 64 bit mode, doubled the size of the registers and for the first time doubled the number of them. It also got rid of some long outdated legacy and duplicate instructions. It also added in the SSE2 extension group. It also changed the interface between the CPU and all other CPUs, memory and chipsets. Yes, it can run in all 32 bit modes but, it does best in 64 bit modes. It leaves no one behind. It is not simply extensions but, the next step for the x86 ISA. And it is well balanced. It does all software well rather than good with some, bad in others and outrageously poor at times.
EMT64 is a kluge. No 40 bit addressing (only 36 bit), lots of errata and some glaring holes (missing 64 bit IO MMU).
Pete
Dear Grimes:
A rough guide to volumes given the typical demand surge of a "hot" product, 1-1.5 million K8 in Q3, 2-3 million in Q4, 4-6 million in Q1, 6-10 million in Q2 and 9-12 million in Q3. By high end Q2 to low end Q3, AMD will need to make 10 million K8 a quarter. Now either they supplant Fab 30 with an outside foundry making the cheap K8 Semperons (the slower clock allows the slower foundry process (built for yield, not speed) to produce saleable clock bins) or they bring up Fab 36 faster to make more CPUs. Most including myself figure the former, but they would be happier if the latter was truly ready.
Pete
Dear J3pflynn:
Given Fab 30 and Fab 36, about 1-1.5 years plus conversion can be incremental on a as needed basis, but it is usually needed fast in this business, often before you're ready.
Pete
PS, they will be ready by Q2-3 of 2005 for Fab 36.
Dear Tecate:
My point was that Intel was neither first or best at 64bit. AMD has sold more 64 bit than Intel. And that shows just how poor Intel is at 64 bit. They even had to follow AMD's version of it. Pretty lousy for a company 6-7 times as big.
Pete
Dear I_banker:
Yes 130nm SOI Semperons AKA K8 Semperons. Although much of this will transition to 90nm SOI Semperons, when that design is in production at 64-69mm2. With "world class" yields, 450 64mm2 good dies could be made on a 200mm wafer. Thats more money than 250 130nm bulk 100mm2 K7 Tbred Semperons (or 200 120mm2 Barton Semperons).
Especially if many of these work (1.6GHz-2GHz (2800+-3300+)) with a TDPmax of 25W. Many of these could be used in fanless HS systems and laptops. A decent amount will likely be able to go into 15W TDPmax form factors in the 1.2-1.6GHz (2400+-2800+) range 754s. Definitely HS only and possibly even no HS 2-3lbs subnotebooks (or wearable systems).
Pete
Dear Keith:
I notice not a single 130nm bulk window after H2-04. They can probably supply any demand after Q1 from inventory (with great indications as to when it will run out to buy before they are gone). Others will supply what little demand is left from their inventory (this is when the prices rise significantly). Semperons will be all K8 based after that, either 130nm SOI or more likely 90nm SOI for their small size and thus both cheap and high output per wafer (300-400 depending on defect rate). 2K wafers per week would make 8-10 million K8 Semperons a quarter. The other 2-3K wafers per week would go to make the 6-8 million A64s and 1 million Opterons a quarter.
Pete
Dear Tecate:
Intel was not the first with a 64 bit chip. That was IBM, IIRC. The first well known one was the Alpha, IIRC. They didn't even make the first x86-64 bit chip. That was AMD. They were the first to 8086, but many of us would have preferred IBM use either the Motorola 68K or the National NS16032 as either were more powerful, used flat addressing and were easier to 32 bit (68020 or 32032). As it was the 68K went into many workstations and small servers (and Apple Lisas and Macs) and the NSx32 went into obscurity.
Pete
Dear Keith:
AMD also stated that there will be no more K7 production after Q1-05. Since to stop producing by Q1-05, you must halt wafer starts by end of Q4/04. Thus 50% of K8 wafers is 50% of all wafers for CPU production (some wafers for new designs and engineering). But since more CPUs will come from a 90nm wafer than a 130nm one, more than 67% of production at the end of Q1 will be 90nm. 65nm Fab 36 starts sampling by Q2 for Q4/05-Q1/06 production. 100% 90nm means about 12-15 million K8 per quarter. That's enough for 25%-30% unit marketshare and probably higher revenue share. Fab 36 could add another 24-36 million K8 and K9 production (assuming near constant CPU sizes). That puts AMD to have the capacity to garner 75% to 100% of CPU unit marketshare. That's when Intel has to really worry.
Pete
PS, I figure that when Fab 36 starts to really produce, they will either change Fab 30 to 65/45nm 300mm or to make flash. If the latter, they will start on a new fab by the end of 2005 for 45/32nm 300mm production in a different country.