Re: Unaligned access improvements
There may be other improvements in this area, but the big one I'm aware of is that Nehalem is once again able to do store-forwards to unaligned loads across 1,2,3,4,5,6,7,8, and 12 byte boundaries (ie forward data to unaligned loads without waiting for that store to complete its write to the cache).
This was implemented in P4 thanks to the additional pipestages that were available in that uarch, but Merom took a step back to the P6 days of only being able to forward to loads that are aligned to 8-byte boundaries (they may have added 4 and 12-byte alignment, I'm not sure).
Nehalem once again gets close to an alignment-agnostic state of being, which should close some performance glass jaws that exist out there today for Core 2 chips.