Mas, these are very interesting details. I'm particularly interested in the fact that Haswell has in fact FIVE unique die implementations, built from scalable IP. Haswell is a big jump for them into more of an SOC methodology.
In previous generations, their only modularity was a chop between quad core and dual core. They introduced a second graphics configuration in Sandy Bridge, but the Haswell modularity seems more building-block based. See how the dual-core die photo (with GT3) has the graphics as two mirror-image copies of the block from the quad-core photo (with GT2)?
Figure 5.9.1 shows the different modularity blocks. They have modularlized the cores, the caches, the graphics, the DDR3 and LP-DDR3 channels, and various parts of the SA (System Agent...?). Looks like the main differences in the SA blocks are the inclusion and exclusion of DMI and OPI interfaces, and PCIE block.
I also found it impressive the low power they were able to get on the OPI interface. 8mW/GB/s for the ones with on-die PCH, and about 10mW/GB/s for the one with eDRAM, all while providing a 25x dynamic range, and scaling over 100GB/s for the big graphics die.
It really is true that most of the "tock" level changes on Haswell were "under the hood", so to speak. It wasn't a lot of user-visible performance, but was still a crowning achievement, nevertheless.