Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.
To dougSF30
That doesn't sound like the talk I am thinkink of. There was also video.
"critical timings are in the memory interface"
For integer work I agree with you but having done some fp video work I can say that the limiting factor there is dense fp operations that fill the execution pipelines.
With 16 XMM registers rather than 8 on x86 the situation gets worse - register pressure is largely eliminated, now you are going to have fpadds and fpmuls backing up the pipeline. 3DNow was balanced against the number of execution units (3) but SSE is twice as wide so widening the # of units makes sense to me.
You've doubled the register file and ~ doubled the fp width, now its time to widen the execution width.
Increasing the speed from cache does not address the pressure point, remember the fp units have their own instruction queue
MCGrath's Stanford talk was very clear that an enhanced fp unit was for the k8. He did not say there would be an extra execution unit - that is my speculation - but it makes sense.
As you say the fp units are specialized. While having 4 identical all powerful fp execution units may be ideal it may not be practical. Better engineering may be 2xfpadds and 2xfpmuls.
SSE2 has a lot of integer too
Why pentium overtook Athlon
I was reading the botes on the Virtualdub site www.virtualdub.org and I see that the author's experience is similar to mine. This leads me to conclude that a major reason for intel's success with its pentium4 was thanks to Via's buggy Athlon chipset.
I had a Gigabyte motherboard with an XP1800 and the 686B southbridge, used mainly for video work. While it was running win98 it was perfectly stable. Then I upgraded it to Win2000. In a couple of months I had disk corruption on the system drive about 5 times. Four times I had to reload the whole OS and application programs. I researched the topic and found attempts by a George Breese to software fix the chipset. In the end his project was adandoned. I tried every combo of drivers, via and microsoft, all to no avail. The failure only happened with large files
Eventually I dumped the motherboard (in the trash!) and bought an Nvidia-based board. All works fine now.
Via knew about this problem but tried to hide it. It cost AMD a lot of business I'm sure and damaged their reputation. It was just about this time that intel ran away with the market.
If I had persevered less I might have blamed the Athlon. Its a sorry tale of Via, thank god for Nvidia.
SSE3 and execution units
McGrath already said, two years ago, that a new fp unit was far advanced. That's why I think its all part + parcel of the E0 stepping.
I don't see that it would require massive changes. Its just a matter of having the same interface as the existing section, something that was part of the design parameters.
Better SSE+SSE2 is a good thing to compete against PowerPC and provide a boost to workstation & big technical computer designs, i.e. the very areas where Sun is strong. Demand by Sun will drive this transition IMHO. The server biz is already doing very well - the area where k8 already shines and fp is unimportant.
Better fp may well be the low hanging fruit right now.
The decision to drop MMX, 3DNow & x87 in 64-bit XP64 programs makes a better SSE/SSE2 more relevant than ever.
Cheers
SSE3 & "improvements"
In the Stanford lecture, Kevin McGrath alluded to a new floating point unit for the K8.
Looks like this wil come with the E0 stepping early next year.
http://www.xbitlabs.com/news/cpu/display/20041104142743.html
The existing 3 execution unit design is excellent for MMX & 3DNow but is not wide enough for SSE & SSE2, which requires a minimum of 4 units.
IMHO the addition of SSE3 is of minor interest compared to any speed up in SSE/SSE2 which is heavily used, and execution unit limited, in video data processing.
Loop alignment
In the assembler code of the rc4 the main loop has the label .Lstart
The top of this loop must be aligned on a natural boundary, preferably on a cache line boundary - 32 bytes. There should be an ALIGN directive. Note that the whole function is aligned at 16 bytes, this is pointless as far as I can see (compared to the main loop that is executed hundreds of times).
If the top of the loop is not aligned this will interfere with decoding and may reduce cache efficiency. The final stage of any really time-critical code is seeing where the code boundaries are.
Unless the code is specified to be on a certain boundary you have no idea where it will locate so the first run may be really, really fast then the second run slows right down. This is so common when code is being tested, the addition of a single, seemingly inconsequential, instruction can make or break the runtime. Often its the debug instructions themselves and when they are taken out the code is nowhere near as fast!
AMD64 Optimization
Well I find that style of assembly difficult to read. I'm used to good old masm.
Like any assembler program there are areas where I think it can be improved. Its a really simple algorithm after all, at least the bit done in the assembler file. Looks to me that the speed up is by reducing the number of memory accesses by keeping variables in registers and processing 8 bytes at a times. There's just 18 instructions in the main loop.
There is no use of SSE or MMX regs in this, just the extended GP register set.
If the writer has benchmarked his code as he went along there's probably little micro-optimization to be done. There's no obvious attempt to align the loop on a code boundary - this is a mistake.
Serial Flash Interface
Can someone put some bones on this please?
Sounds like a Rambus type design?
Parallel execution
"The second example has to execute sequentially as they all
modify EAX."
Accessing memory is much slower than a register dependancy, by factors of 10. By increasing code size you are creating memory accesses. I think you'll find that the load unit can queue these accesses anyway.
Download CodeAnalyst and look at the (simulated) code execution.
If you want a real speedup think about the ja opcode - tests the CF and ZF in one instruction. One of my favorite tricks!
The cache line is 32 bytes wide. Assuming esi is aligned and the line is not in the cache, the first first read will cause a cache line replacement and the second two reads will be from L1.
The tech docs have information on the load unit but I'm sure it can handle >3 queued load from the address generation unit.
Used to be that the esi load could not be fast decoded and you should force the opcode to be esi+0. Not sure if that is still the case.
Given the shorter code of your second choice that must be faster.
Market Cap of AMD is SOoooo small
Always remember that there are really very few shares out there. Current market cap is < $6bn
I recall when it went through $15bn.
Lesson is, don't look at the price/share, always look at the cap - a much more meaningful number
Re: MSDN Presentations
Whew! Looks like Microsoft marketing is gearing up for XP64 doesn't it. Typical first part of their marketing plan. Next you should see some articles in the magazines.
Perfect way to start it all rolling.
MSDN Webcast: AMD64 Architecture Drilldown: 64-bit Performance and 32-bit Compatibility—Level 400
November 4, 2004 11:30am-12:30pm
http://msevents.microsoft.com/cui/WebCastEventDetails.aspx?EventID=1032259938&Culture=en-US
Make a note in your diaries.
FP, SSE & SSE2 unit
Collectively referred to here as "FP".
Incidentally this is why I am so interested in the E0 stepping. Obviously when McGrath gave his talk at Stanford and alluded to the enhanced FP unit the work must have been well underway.
While 3 FP execution units (as in Athlon/Athlon64 & Opteron) would seem to be a good tradeoff for x87, MMX & 3DNow processing it makes no sense when you go to SSE and SSE2. You need 4 units so that 4x32-bit FP data can be processed in parallel. Reducing the latency of FP & SIMD operations makes little difference if the pipelined execution unit is stalled.
We know from other processors that considerable speedups can be achieved in FP code (IBM's Power series for example). Funny thing is that the target markets for these higher-end processors rarely need really great FP performance. Where its needed is in video processing - i.e. the desktop.
To me the addition of SSE3 is not that important. A better SSE unit is what I need.
Remember I used the beta. Sad to say some of the optimizations that were apparent actually made the real code run slightly slower - it was a beta remember - so I fell back on my own experience. I'm sure the newer versions are better. It is a free download from AMD. They even sent me a pocket knife and a fashlight for being a beta tester. The pocket knife is particularly useful for prying intel motherboards out of rack units, seems to have been designed for the purpose. . . .
When I used CodeAnalyst it was test and repeat. I don't know about newer versions. I'd like to hear your experience.
The trade off in this stuff is accuracy v. speed. I needed to do a completely compliant conversion. FP is the only way to go. You are right for jpeg display - who cares as long as there are no visible artifacts but when you are doing repeated conversion cycles, well then accuracy is all.
Integer will always be preferable to fp in x86 machines, the cost of an integer -> fp conversion is high. Scaled integer and FP need not be exclusive. I left out of my explanation of MMX vs FP that a lot of the special case optimization code, like all black blocks, is always integer.
MMX & FP in Video code.
There'a a lot of data shuffling that goes on and clamping the outputs as well as initial operations that can be done in integer.
Basically the loop sequence is:
1) Organize the integer*16 data (MMX)
2) Do operations in integer that lose no accuracy (MMX) e.g 16*16 -> 32 or simple adds
3) Convert to FP
4) Do the muliplicative parts (lots and lots of FP calculation)
5) Convert back to integer
6) Clamp the output (MMX)
7) Organize the output (MMX)
I will do a version that replaces the MMX with SSE2 when I get time. Haven't looked at the code in some years
DCT observations
I recall that when I was developing the DCT code I ran it through AMD's CodeAnalyst (then in beta form). My DCT did a lot of its work in floating point on normal 8x8 macroblocks. MMX was used a lot too - I did all I could to make it fast but above all accurate.
The CodeAnalyst pipelining graph was amazing. The CPU was completely stalled by full pipelines. No memory nor decoder bottleneck , just instruction after instruction waiting for pipeline stages to retire. I managed to keep everything in registers and did some tricky bits but in the end there just wasn't much more that could be done. There was a lot of MMX too with PAVG used heavily for motion comp.
It was nearly an optimal DCT output-quality wise.
I don't know about XviD but the DivX encoder on an XP 1800+, without any external hardware, was faster than realtime too. I think the results of software are about the same. In that scenario the cost of the video disappears.
Still I stand by my belief that video processing is one, if not THE, app that will drive home users to needing faster hardware. Funny thing is that its a different demographic interested in processing video - and older group, some retired, who have the money to spend.
Cheers
The sage of video
is www.videohelp.com
Everything takes a long time - ok if it all works but when it goes wrong its half a day wasted.
Big fast disks are the norm. A 120Gig drive is $59 at Frys yet the biggest you can buy is 300Gig. Strange, there is no high end really. Processing time is really a function of CPU. Example: if you have to shrink video to fit on a 4.7G DVD it will take 2-3 hours (decoding & re-encoding. If the video fits already the process takes 5 minutes. Just about the same amount of disk transfer.
19.2 Mb/sec for HDTV
Yep, I was right, my memory was accurate. Max bandwidth per channel is what I said:
http://en.wikipedia.org/wiki/HDTV
RAID Systems for Video
Funny you should mention that. I have a 3Ware raid controller, just for the purpose.
I think HDTV is 19.2 Mb/sec - its a number that's in my head. Easy to check. I think you'll find that the compression ratio is 50:1, not 4:1. Its pretty much state-of-the-art. It still amazes me that such a great picture can be delivered through my little 'ol antenna outside.
Grabbing HDTV
Came across this guide that talks a little about transferring HDTV to disk and processing.
http://www.vidphiles.com/
Seems like grabbing a DTV signal isn't much of a problem.
Grabbing video
Yes Pete, I agree that a grabber may as well do some video processing. I can, but do not, grab video in MPEG2. The reason is that I want lossless compression until I have finished editing. So I grab an AVI, sometimes a Huffy. Finally, at the rendering stage its converted to MPEG2, sometimes MPEG4.
I actually wrote code (asssembler) to do fast DCTs. Some of it was in popular freeware programs and one section was in the DivX codec. Don't know if that's still the case. Probably not as it was x86 assembler and it makes sense for DivX to be hi-level.
My point is that the *processing* (as opposed to the grabbing) of grabbed video, something that many ordinary consumers are or will be doing, requires CPU and lots of it. Grabbing can and often does use the GPU but even that is not really necessary with CPUs >2GHz, more a hangover from old, slow processors and slow busses. I can't remember the last time I had a dropped frame during capture.
I will be building a media center one of these days, to grab HDTV and record it. I am really waiting for the HD-DVD spec to be sorted out. I have given up on anything less than HDTV for OTA feeds. I'm not sure if a "dumb" grabbing card can get 1080i. Its 19.2 Mb/sec, a little over 2MB/sec. At 50:1 compression what's the GPU going to do with it, surely its not going to do any more compression on it? Aren't all grabbers "dumb" when it comes to putting the HDTV feed onto disk?
Cheers
Video processing - Pete
I don't know what editing, rendering and burning video has got to do with the GPU. The impact on the graphics card is very small, just previewing. The work is taking a video data stream from disk, processing it - including compression - and writing it back to disk.
Dedicated hardware has a small role to play IMHO. The cost of pushing vast amounts of data to a graphics peripheral - across the bus - outweighs any advantage of dedicated hardware. The bus is already pretty busy getting data from and sending data to the hard drives. What would the graphics hardware do? MPEG2 encoding? What then when you want to do MPEG4? Or whatever new encoding comes along (the BBC is thinking of creating one I heard)
The lesson of the last 20 years is that processing data streams is better done in software than hardware. Therefore, respectfully, I have to disagree with you.
OT Flash blocking in Firefox/Mozilla
If you're using either of these browsers I suggest you look into the flashblock extension. Blocks automatic playing of flash/shockwave and replaces the video with an elegant "f" than changes on mouseover to a play button so you can watch the video if you want.
Boy does it save bandwidth. flashblock.mozdev.org
Processing power - for video
Have you tried editing, rendering and burning video? The amount of data is huge - just loading one DV tape from my home camcorder in 12+Gig.
Then it has to be edited and shrunk to 4.7GB
Given the number of camcorders out there I know that lots of home users would like to do this.
The rendering phase for a 1 hour movie can be 3 hours with minimal effects/music. Add some special effects and titles and you are looking at 5 hours.
Then my daughter burns DVDs onto her hard drive using DVDDecrypter ( truly great and free app), shrinks them with DVDShrink (another great and free app) before archiving them onto DVD +R. It takes 2 hours to process one DVD and most of that time is in the shrink phase where is is recoded, highly CPU intensive.
This is stuff done in any home. Minimum processor is 2000+ and that just squeaks in. I use a 2600+ and will upgrade soon, its not worth a 300 point change to upgrade but a 2600->3600 is worthwhile.
So the answer to your question - ignoring gamers - is a lot more power yet.
Of course the big, big thing as far as power and i/o is concerned is HDTV. Some early hi-def home camcorders are out there now but they are still expensive ($3000). Once the Blu-ray/HD DVD standard is sorted out there will be an explosion in video demand. I see the need for full speed home multi-processor units with incredibly fast i/o just 12 months away.
I guess this means they won't be doing the 10GHz Pentium 4 either?
Remember how they touted their process technology and those high speed transistors. 10GHz, it was said, is just a matter of time.
Looks to me that they fired all their best engineers. That H1-B program may have been just an eensy-weensy bit of a mistake.
In the meantime what's a video processing user to do?
EOLing K7s
Keith:
AMD may have little choice in the matter. With the release of Windows XP64 the available fab space will have to meet end-user demand and Microsoft's expectations. Looks to me that AMD is working towards a 1st qtr release by Microsoft.
HP Workstation Customers Prefer AMD Chips
Oh COOL HEADLINE FROM FORBES!
http://www.forbes.com/markets/2004/09/27/0927automarketscan08.html?partner=yahoo&referrer=
And Sun is a consumer-product company?
I don't think your argument holds water.
Gateway will disappear unless it drops the Intel-only restriction. Too many buyers at least want the option.
WOW! 64-bit desktop improvements.
Comparing 32-bit and 64-bit Linux for desktop applications with AMD and Pentium cpus.
Take a look at http://www.anandtech.com/linux/showdoc.aspx?i=2213
and look at a) how Athlon64 in 32-bit mode compares to 32-bit Pentia and, best of all, how 64-bit Athlon64 is MUCH faster than 32-bit. You have to do a mouseover to see the 64-bit numbers but the results are dramatic.
Frys, Best Buy & Staples were very busy with obvious back-to-school activity (So. Cal.).
Advantages of greater address space w/o physical memory.
Three questions you pose. I'll speculate with one answer.
The total address space used will include all the executables including mapped in dlls, stacks plus data area. So let's say ~500K typically will be used by code in a typical large application (a lot of the footprint will be system stuff that's mapped in and shared by all apps but each app has to count it in its own used address space). In standard Win32 that would leave 1.5G for data. Win32 LAA would give 3.5G
In media apps there's going to be a lot of data. As address space is used up there will be some point at which old data is recycled - destructor functions being called. I presume that the application will allocate some large chunk of free space at startup and use that, keeping its own "free list".
I'm interested in whether the destructors that should be called more infrequently with the 3.5G of address space impose any significant penalty. Some programs may even "spill" to disk files - Adobe Photoshop does this. It may be faster to use the open pagefile rather than generate even more disk-head movement.
The first thing to know about 4GT is whether, in practice, there's any change in performance. That's a rather important question.
The page file will have to be increased too.
Thanks mmoy.
Actually its not about hardware, its about address space. It enables the app to use a full 4GB of address space (not physical memory).
A good free test might be POVRay, it includes a benchmark image. Their website is www.povray.org.
It may only come into play on really large data files. I don't know.
Anyone with Windows XP64
Is there anyone out there running WinXp64 who has tested, or is willing to test, applications that need large address spaces (Maya type apps come to mind) with 4GT?
All you have to do is run the editbin to set the LAA flag in the .exe header.
I'm really interested in the results.
Details on 4GT are here:
www.amd.com/us-en/assets/content_type/DownloadableAssets/Expand_Memory_of_32-bit_App_-_Microsoft_4GT-_6204.pdf
Looking at dual-core Opterons
at:
http://www.amdzone.com/modules.php?op=modload&name=Sections&file=index&req=viewarticle&a...
it appears that the dual is < 2x size of 90nm Opteron.
Or am I mistaken?
65nm reduced leakage
That PR from Intel you quote is interesting:
"The novelty has the same cell size and over 0.5 billion transistors. Transistors have 35nm gates that´s about 1/3 smaller than that of a 90nm crystal. Besides, according to the press release, leakage currents were reduced by 4 times.
The press release stresses the reduced power consumption made possible by sleep transistors feature that disables unused circuits."
Hey I'm sure that 99.9% of transistors in a 70Mbit SRAM are unused at any one time. What the sense amps and an active row+column. There's not even any refresh going on (its SRAM).
This PR seems to skirt the issue of whether an individual transistor has less leakage than an intel one at 90nm. Looks to me they have some major issues at 65nm else the PR would not have that weasel.
Well Sgolds I must say I admire your candor. Not sure putting such personal info out there for all to see is such a great idea. I've been attacked based on my google entries even though they are good. People can always spin information one way or another.
Good luck, seriously, with the restaurant. Remember if its losing money after 1 year - BAIL! Most restaurants do fail after all.
The one restaurant that I knew really well made money from day 1 and after a few years the two principals were millionaires. The money was made out of the martinis, not the food. Its in Laguna, CA.
No, Win3.0 was a pretty amazing piece of work especially for the '286 processors. Basically it did memory mgmt in software. Remember there was somethign called "hamburger" or something that sat on the heap. Win3.0 patched dll entry points on-the-fly.
Even Win95 (and maybe '98 I never dug into the internals) used a lot of DOS services.
NT was a much better OS but there are still a lot of people running 98. I have one laptop still running it just so I can make DOS boot disks. NT is *real* in the sense that it is somewhat stable.
I think most people would consider 95 and 98 real OSes, even just by the number that were bought, but they are a b*tch to maintain