Benchmarking at the Edge of Comprehension

Samuele Marro; Jialin Yu; Emanuele La Malfa; Oishi Deb; Jiawei Li; Yibo Yang; Ebey Abraham; Sunando Sengupta; Eric Sommerlade; Michael Wooldridge; Philip Torr

arXiv:2602.14307·cs.AI·February 23, 2026

Benchmarking at the Edge of Comprehension

Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr

PDF

Open Access

TL;DR

This paper introduces Critique-Resilient Benchmarking, an adversarial framework that enables the evaluation of large language models' capabilities even when full human understanding of tasks becomes infeasible, ensuring stable and meaningful benchmarking.

Contribution

The paper proposes a novel adversarial benchmarking method that relies on critique-resilient correctness and localized human verification, addressing challenges posed by increasingly advanced models.

Findings

01

Scores are stable across models and correlate with external measures.

02

Effective in the mathematical domain with eight frontier LLMs.

03

Reformulates benchmarking as an adversarial game with human adjudication.

Abstract

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Topic Modeling