It seems to me that you are blaming the wrong party here. Obviously the AnTuTu code was sloppy, and lacked verification for validity as a benchmark.
It would have been very simple for the test writers to compile the code on various compilers and run it on the same physical machine such as a desk-top PC, to see whether there were compiler generated artifacts in the results.
the test developer could have very easily turned off all optimization on all compilers.
They could have just written the code in assembler statements and be sure that the compiler does not interfere.
These artificial tests are merely rough indicators of performance. To get overly excited about them is like a tempest in a tea cup.
Instead of an artificial memory test, why not do a real usage test, such as down load a long movie, or have a stock steamer running, while playing a youtube video, and holding a voice call at the same time ? If all tested phones do equally in that, then they are equal in performance, and no further bench mark is needed.