Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Is has plenty of relevance in that I know enough about this field to distinguish
between what is believable and what is BS.
Based on your posts I have seen I beg to differ.
So pardon me if I believe that Ditzel & Co. have a tad more credibility than
some anonymous poster on an Intel message board.
The same Dave Ditzel famously quoted as saying that out of order RISC
processors were more complicated than out of order x86 processors to
justify his failure to bring Sun uPs up to the level of its competitors?
I wonder if TMTA's develoment direction for 8000 is another example
of Ditzel's professional failings/blind spot in action? :-P
The difference is I have no financial stake in the success or failure of TMTA.
But Ditzel & Co. are in a tough spot and their immediate future is tied to
continuing to successfully sell this (IMO) technological snake oil. The question
for you to consider is whether going to the 8000 will give them the combination
of performance, power, and cost to keep them in business for another product
development cycle given that Intel has their niche firmly locked in its gunsights
with Banias and its successors.
wbmw: Your condescension is unnecessary. As I have developed compiler back ends
which included code optimizations similar to what is done in TMTA’s CMS, I think I am
qualified to speak with a modicum of authority on these matters.
No kidding? I wouldn't have guessed it. How recent was it and what was the target
architecture? If it was recent work and the target was a VLIW or a wide issue in-order
superscalar then perhaps you could explain how did the work of Mahlke and Hwu
influence your optimization strategy? If not then what relevence does it have to the
discussion of the potential architectural speed-up of the 8000 over the 5x00 series?
TMTA will realize a greater improvement than in the typical hardware-only product cycle
because the CMS can handle a good deal more ILP than what has been achieved with
Crusoe’s 128-bit VLIW.
You really believe that? You should examine how much ILP IPF compilers are finding
in raw program code source. They still have lots of room for improvement even for a
6 way VLIW.
Despite the cute name CMS is just a JIT compiler. A compiler whose source code
is x86 object code and output is VLIW native code. The x86 source code already has
a lot of the ILP squeezed out of it compared to raw program source code due to
the serializing nature of few GPRs, a condition code based computational model,
and a lack of any aliasing information regarding program variables. At the same time
because CMS is in operation on the fly interleaved with program execution it has to be
very parsimonious in the types of profile information it can capture and examine and
the types of computationally intensive code optimizations it can employ compared to
a state of the art compiler for IPF or superscalar RISC.
In summary, getting a decent architectural performance improvement from going from a
4 way VLIW to an 8 way VLIW is a problem of first order difficulty even when your compiler
has full access to program source code information, extensive global run time profiling
information, and virtually unlimited compile time. But in the case of TMTA, CMS lacks all
three and as a result a really difficult problem is made much worse. So excuse my
extreme scepticism about your claim.
Behavioural psychologists have known for quite a while that pink ambient color has
a calming influence on aggressive or beligerent people. Sometimes drunk tanks
and holding cells are painted pink to take advantage of this. Perhaps its a subtle
experiment by investorshub to calm thing down a bit.
The important point for latency is how many hops to the processor
holding (or hosting) the cache line and how many hops back.
That's how it works in an EV7 box but the Opteron doesn't use a distributed
directory based coherence scheme like the EV7 does. How will an Opteron
wanting to access a section of memory know which other Opteron(s) have
that memory cached until it has waited for a potential reply from the Opteron
in the system furthest away in terms of round trip delay?
chipguy, I see. Do you have any link to read?
Sorry, nothing specific. Some universities have computer engineering undergrad course material
on-line that might help. Try google on "cache coherency" or "MP system architecture" etc.
4P Opteron has the average diameter 1.17 hops
8P Opteron has the average diameter 1.64 hops
So the memory access latency is 121 ns for 4P system and 137 ns for 8P Opteron system.
You still don't get it do you? The average number of hops is meaningless, it is the maximum
number of hops in any system that is important. The Opteron relies on a broadcast coherency
system. That means any processor that wants to read memory has to broadcast its intent to
every other processor in the system and then wait to see if one of them has the target line
cached. Hop, hop, hop, hop.
P.S. My favorite stocks right now for hitting a potential “home run” are TMTA and NUVO
I don't know anything about NUVO but if I was in TMTA I would be looking for an exit
strategy. Their JIT binary recompilation on VLIW technology was novel a few years
ago and the low power angle gave Transmeta a foothold. But the performance was
pitiful even compared to the crippled desktop processors it competed against.
Where does it go from here? From four way to eight way VLIW? That will raise power
significantly more than it will raise performance because it will be far beyond the
point of diminishing returns for ILP yet much of the active portion of the processor
will double in width (ifetch, dispatch, execution resources, register file ports etc). A
wider VLIW also means more NOPs, lower code density, and the need for larger
on chip caches. IMO the 8000 might be able to approach Pentium M in either
performance or power characteristics but not both. And that's before Pentium M
even reaches 90 nm.
It looks like TMTA will need to compete on price. Foundary based manufacturing, a
relatively large die, and low ASPs hardly sounds like the recipe for a "home run".
You should look into learning how to read. It will open new worlds for you.
You should try your own advice. The document you linked to says 1.3 GHz Madison
results were "Early lab 4P results". You sure didn't think the 800 MHz engineering
samples of Hammer reflected the speed at which Opteron would be introduced at
so why do you think this Sept 2002 document reflects the maximum clock rate of
Madison?
But by all means continue to believe that it does. We can remind you of your claim
when Madison is introduced and at every subsequent speed grade bump after that.
And Itanium is only supposed to make it to 1.3ghz on .13
Says who? The inquirer?
If you believe it then you will be in for some nasty surprises over the next
year or two.
A 64 bit adder will be slightly slower than a 32 bit adder. But the Pentium
4 readily demonstrates that the time to do a single addition doesn't come
anywhere close to being a limiting factor on processor clock rate.
Chipguy:You obviously don't have the foggiest notion.
I am not the one who claimed a 64 bit adder is 20 to 30% slower than a
32 bit adder. Or that the critical path in modern uPs, and therefore their
maximum clock rate, is dominated by addition speed. Toss me another
nugget from your vast store of knowledge. Its been raining on and off all
day here and I could use another good belly laugh.
In 64-bit design, there will be several places on the data path that needs 64-bit adder, instead
of 32-bit adder. Consequently, for the same micro-architecture, 64-bit design will be at least 20-30%
slower in terms of MHz
You obviously don't have the foggiest notion how fast modern integer ALUs/adders are or how
adder speed varies with word length. The maximum propagation delay of a parallel adder is a
logarithmic function of word size. Most of the work is done in the first few levels of the tree, say
4 to 8 bit sub groups. Depending on the exact implementation details the max prop time of a
64 bit adder might be 10% or so longer than a 32 bit adder in the same technology.
Second of all adder speed is only a part of the logic propagation delay that can define maximum
clock speed. The other components include operand bypass mux delay, register output delay,
and result register setup time. In fact, addition can be done so quickly that the fastest clocked
processor on the planet, the Pentium 4, can effectively perform TWO back to back 32 bit additions
in one clock period (a little over 300 picoseconds). I think it is pretty safe to say that it would be
falling off a log easy to perform *one* 64 bit addition in a clock period and have margin left
over to increase clock rate further (athough no doubt other critical paths would appear).
Just as an aside, when the 64 bit Alpha 21064 appeared it clocked up to 3 times faster than its
32 bit contemporaries made with similar feature size processes.
So, 4-way Opteron works practically as a single chip with 4 cores on die.
Practically? I guess if you don't mind adding four chip crossings/retiming/rebuffering
worth of delays to implement coherency to memory accesses, then I guess it works
*practically* as a single 4 way CMP chip. A really badly designed one, that is.
Let me prove you that Alpha does not scale so well by itself. See, the
score of AlphaServer ES40, 4-way SMP, EV6.7 (21264A) 667 MHz, 8 MB
L2 cache is 400, but the score of AlphaServer ES40, 4-way SMP, EV6.8
833 MHz, 8 MB L2 cache is 350.
What does the scalability of EV6x systems have to do with EV7? Pure
obfuscation on your part.
Opteron scales very good from 1 to 2, and very good from 2 to 4.
Then from 4 to 8 will bear some penalty, because redundant HTT links
in 4P will not be redundant anymore.
You are ignoring the problem of Opteron's cheap and dirty broadcast
coherency scheme. You better go check the size of the memory read
latency penalty when you go from 1 to 2 Opterons before making claims
like this.The distributed directory scheme of EV7 is more complex and
silicon intensive but has much better scalability. But that's ok, the EV7
was intended for big iron applications, the Opteron wasn't.
As Opteron is borrowing a lot from EV7 and is made by the same people, you can expect 32-way
Opteron to deliver very similar numbers.
This claim is so ignorant of so many public domain facts about the two architectures I could
spend the next hour enumerating them. Suffice to say that Opteron will not scale like EV7.
BTW, the development cycle of both Opteron and EV7 largely overlapped. How could the
same people "make" them both? Was there significant moonlighting by ADTers going
on? No wonder both chips slipped schedule so much. Stop trying to rub Alpha technical
excellence off on AMD. You do know where the former EV7 and EV8 teams are now and
where the EV79 team will be shortly?
The difference between you and me is that I already know than 4P Opteron 1.6 Ghz is
faster than 4P Itanium II 1 Ghz, even though single Opteron is slower than single Itanium II,
but you will learn that four weeks from now.
The Opteron will be faster on some work loads, the McKinley will be faster on
many others. Hardly brilliant when you consider the McKinley is a 0.18 um bulk
CMOS chip with aluminum interconnect and the Opteron is a 0.13 um SOI CMOS
chip with copper interconnect. The 0.13 um Madison will ship a couple months
after Opteron and that will be a entirely different ball game.
LOL, a barrel full of minnows.
IMO a second tier OEM is SGI, Unisys, Bull, etc.
I guess its like being out on a first date. AMD will have to wait and find out.
Opteron is looking really, really, good.
The only independent entities that are qualified to judge that right now are OEMs
that have evaluated the chip, which according to Ruiz includes just about everyone
in the computer industry.
Unfortunately your claim about Opteron looking really, really good seems at odds
with the fact that AMD hasn't snagged any first or second tier OEM as customers
for their chip. What's up with that?
... as Intel can't beat AMD on technical benchmarks...
XP 3000+
SPECint base/peak 960 / 995
SPECfp base/peak 776 / 869
P4/3.067
SPECint base/peak 1099 / 1107
SPECfp base/peak 1077 / 1091
What's up with that?
P.S. Btw, I think Motorola will make a ton of money supplying PowerPC's as embedded processors
for Newisys boxes
A laser printer, car transmission, or video game console is major win. The market for
PowerPC in embedded control is about 10m devices a year. The Newisys socket is
noise and won't affect Motorola's bottom line in any measurable way.
By the way, IBM always made very good profits selling processors to Apple.
I very much doubt this for two reasons:
1) Apple has for years carefully played IBM and Motorola off against each other with
its dual sourcing policy to insure rock bottom prices. Embedded control has been
in recent years has offered higher ASPs for Mot which is reflected by its future
develpment efforts (i.e. what the G5 will be versus what Mac fanatics dream of).
2) IBM Microelectronics lost close to a billion dollars last year despite having the
highest foundry service pricing structure in the business. You obviously must be
talking about "profits" in a theoretical sense *grin*.
The Northwood is about 10% larger than the 970 but will have a much smaller variable
manufacturing cost than the 970 due to the latter being made in a significantly more
expensive process (SOI CMOS with one more interconnect layer) and manufactured in
much smaller volumes (Apple is roughly 3% of the PC market and dropping). Add the
fact that development NRE of the 970 will be spread over a much smaller lifetime sales
volume and it is obvious that the total cost per device will be much higher for the 970
than the P4. I will grant you that when it comes to *pricing* the numbers may closer than
the total costs suggests as Intel consistently makes money selling uPs while IBM Micro
consistently loses money.
And you can bet it will sell for far less than $400.
I am not quite so confident. At best it will be manufactured and sold in volumes close to two orders
of magnitude less than a mainstream Intel x86 processor and it is built in a more expensive (SOI)
process.
That being said, the 970 should still be less expensive than any comparable 130 nm IPF processor.
I wonder how much of the server market Opteron may take.
Very little unless it gets a lot more backing from major OEMs than is
currently evident.
HIghly unlikely when Itanium chips are > $2K a pop. Also the clock speed is
still lower than P4 and Athlon. Where's the benefit?
They would no longer be lying about the supercomputer thing.
Nevertheless the difference between the probability this happens
and zero is negligible.