Benchmark-Driven Selection of AI: Evidence from DeepSeek-R1
Petr Spelda, Vit Stritecky

TL;DR
This paper demonstrates that using impactful benchmarks as curricula can enhance reasoning language models' generalization, exemplified by DeepSeek-R1, emphasizing the importance of benchmark-driven AI development over traditional evaluation methods.
Contribution
It introduces the concept of benchmark-driven selection of AI, showing how impactful benchmarks can serve as curricula to improve reasoning models' generalization capabilities.
Findings
Impactful benchmarks can serve as curricula for training reasoning models.
Benchmark-driven development can outperform traditional evaluation-focused approaches.
DeepSeek-R1 benefits from this approach in a sequential decision-making task.
Abstract
Evaluation of reasoning language models gained importance after it was observed that they can combine their existing capabilities into novel traces of intermediate steps before task completion and that the traces can sometimes help them to generalize better than past models. As reasoning becomes the next scaling dimension of large language models, careful study of their capabilities in critical tasks is needed. We show that better performance is not always caused by test-time algorithmic improvements or model sizes but also by using impactful benchmarks as curricula for learning. We call this benchmark-driven selection of AI and show its effects on DeepSeek-R1 using our sequential decision-making problem from Humanity's Last Exam. Steering development of AI by impactful benchmarks trades evaluation for learning and makes novelty of test tasks key for measuring generalization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
