surprised to see the latency did not improve much with the faster FSB...
That does not surprose me at all. Bandwidth and Latency are independent variables.
Latency is dominated by the need to go thru the memory controller,synchronisation and RC delays in going thru the external chip. At this time AMD is king on this issue due to the on chip memory controller. But what has amazed me is the extent that the Intel Core architecture has made up for that intrinsic disadvantage. Having larger caches obviusly helps, since the needed memory data will already be in the cache more frequently than with the Opteron. In addition the Core does an excellent job in prefetching data for loops. Even though there will be unaviodable cache misses,where the long memory latency really hurts, the Conroe makes up for that and whips the Opteron.
AMD also has an intrinsic advantage for multis socketed systems, since each socket has an independent memory link. You would expect the Core to be severly handicapped in multisocketed situations. Amazingly the V8 and quad core outperform the Opteron in most situations, even though the multiple cores have only one shared channel to the memory. I expect faster FSB to help the Core in situations where the multiple socketed Core gets beaten in Bandwidth intensive benchmarks.
The CSI will obviously nullify the intrinsic advantage AMD has with the on chip memory controller. It remains to seen what performance levels will be attained.
The Penryn has "low architecture risk" since it uses the proven Core architecture. The reliability risk with the use of new materials for the HiK gates is the only thing which could hold up the 45nm ramp.
Any decent CSI implementation on the Nehalem should have significant improvements for multi socketed systems. I hope that there will be demos or disclosures of CSI this fall.