Benchmarking at the Edge of Comprehension
Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr

TL;DR
This paper introduces Critique-Resilient Benchmarking, an adversarial framework that enables the evaluation of large language models' capabilities even when full human understanding of tasks becomes infeasible, ensuring stable and meaningful benchmarking.
Contribution
The paper proposes a novel adversarial benchmarking method that relies on critique-resilient correctness and localized human verification, addressing challenges posed by increasingly advanced models.
Findings
Scores are stable across models and correlate with external measures.
Effective in the mathematical domain with eight frontier LLMs.
Reformulates benchmarking as an adversarial game with human adjudication.
Abstract
As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Topic Modeling
