OpenAI's frontier evals team is running out of tests hard enough to challenge its own models. Tejal Patwardhan, who leads that team, tells host Andrew Mayne that reasoning models like o1 broke existing benchmarks faster than new ones could be built, forcing her group to rethink what measurement even means at the frontier.
The interesting material is in the middle of this conversation, not the conclusion. Patwardhan explains the specific mechanics of benchmark saturation and gaming, how voice and vision models require entirely different eval frameworks than text, and how OpenAI is now testing models against real unsolved science problems as a ceiling rather than a floor. The chapter at 00:24:48 on scientific benchmarks is worth your time alone.
The stakes are practical, not philosophical. If evals fail, research direction fails with them. Patwardhan's team is the feedback loop that tells OpenAI whether a new model is actually better or just better at passing tests. That distinction is becoming harder to make, and she is candid about where the gaps are.
[WATCH ON YOUTUBE →]