> Identifying the best pseudo-label seems like it would be the limitation of the...

> Identifying the best pseudo-label seems like it would be the limitation of the approach.

Yes, I think this says in a different way what I'm trying to express.

In GAN, the Discriminator pegs the training to some chosen reality (assuming the "real" data set is truly real). In Challenger/Solver alone, there is no peg. The Solver could hallucinate consistently and "win" the race. It's the consistency that is the goal.

With GPT-4o as an arbiter of the Challenger/Solver training it provides the reality peg (or rather, the peg that biases toward GPT-4o's training set).