yourbankruptcy, I am challenging your assumptions about the efficiency of a dual A64 system. I believe you are overlooking a basic design consideration: With only one aHT channel, that channel has to be divided into two 8-bit channels on one of the processors so that you can support access to I/O and access to remote DRAM. (That is assuming the channel can be so reconfigured, which I am not sure. Otherwise, you would have to implement an external controller to time splice between two different destinations for the one 16-bit channel.)
Both processors get burdened because now they are only connected by half an aHT channel (either halved into 8-bits, or sharing bandwidth with I/O).
Now, what happens when each processor has to access remote DRAM? Your remote DRAM access now is in competition with your I/O. Not a pretty picture! It gets worse when one processor has to copy from remote DRAM to, say, a DVD-RW on the remote processor.
Essentially, in existing designs, it is equivalent to taking a P4 or Athlon and disabling the northbridge - instead, bringing in memory from the southbridge (in competition with PCI).
It would be a crippled system. You won't achieve what you want.