Well, that's what the code currently does. I'm trying to find something faster. If nothing else, I could just unroll the loop into chunks of 8 as well.
I'm going to try the aligned longword approach as that seems to be far more efficient based on some testing that I did earlier.