Open-World Evaluations for Measuring Frontier AI Capabilities

Sayash Kapoor; Peter Kirgis; Andrew Schwartz; Stephan Rabanser; J.J. Allaire; Rishi Bommasani; Harry Coppock; Magda Dubois; Gillian K Hadfield; Andrew B. Hall; Sara Hooker; Seth Lazar; Steve Newman; Dimitris Papailiopoulos; Shoshannah Tekofsky; Helen Toner; Cozmin Ududec; Arvind Narayanan

arXiv:2605.20520·cs.AI·May 21, 2026

Open-World Evaluations for Measuring Frontier AI Capabilities

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J.J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec

PDF

TL;DR

Open-world evaluations offer a complementary approach to benchmark tests by assessing AI capabilities in complex, real-world scenarios through qualitative analysis, providing early warnings of emerging capabilities.

Contribution

The paper surveys recent open-world evaluations, discusses their strengths and limitations, and introduces CRUX, a project for conducting such evaluations regularly.

Findings

01

AI agent successfully developed and published an iOS app with minimal manual intervention.

02

Open-world evaluations can detect emerging AI capabilities earlier than traditional benchmarks.

03

The approach emphasizes qualitative, long-horizon assessment over automated, short-term benchmarks.

Abstract

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.