If Cox lowers the p-value from 0.33 to 0.02, then the trial is ipso facto not intelligently-designed and well-balanced.
Ok, I think we are reaching the essence. If there existed a tool that incontrovertibly compensated for imbalance then there is no need to run a 'well-balanced trial'. Cox Regression is, to a large extent, that tool. Your position is analogeous to saying you must design a transmitter system with S/N of twice Shannon's limit despite having Turbo Coding. I assert that it's passe. (Yes, I am definitely exagerating. Apologies. Just trying to make it clear.)
A measure of this is exactly what I said before - look at a correlation of size of Cox Regression to correctness of corrected HR. I'll bet they are not highly correlated.
(Another measure is to see how much Cox Regression overcorrects - simulate a trial, add Cox Regression noise, perform Cox Regression. I think that there is a good chance that in general Cox Regression produces a better HR than the original data before Cox Regression noise was added.)