With 500 patients, 9902b will be much more balanced than the integrated 9901+02a dataset. Although Cox should give more precision than log-rank, the p-value improvement might not be much more.
This is not, to first order, true in ratio of p value improvement. People always talk about imbalance being less in a big trial - but if you have a low efficacy in a big trial (which you would have to have in order to get p near 0.05) then that lesser imbalance is still just as big as the efficacy.
So for the purpose of simulation, it might be better to just do log-rank since we have no data on the prognostic factors per each data point. What do you see as the power of the trial with HR=1.20 or HR=1.40?
With a TRUE HR of 1.2 and straight log rank the power is well under 50% since measured HR=1.2 will give a straight log rank p value above 0.05. As for other calibration pts I'll have to give them to you later when I can run the MC