QuickScope: Certifying Hard Questions in Dynamic LLM Benchmarks
Taylor Lundy, Narun K. Raman, Kevin Leyton-Brown

TL;DR
QuickScope is a new method that efficiently identifies challenging questions in dynamic LLM benchmarks, improving detection accuracy and reducing false positives.
Contribution
It adapts Bayesian optimization for practical LLM evaluation, enabling targeted, sample-efficient discovery of hard questions in dynamic benchmarks.
Findings
QuickScope outperforms standard baselines in discovering difficult questions.
It reduces false positives caused by noisy outcomes.
The method is flexible across various datasets and utility functions.
Abstract
LLM benchmarks are increasingly dynamic: instead of containing a fixed set of questions, they define templates and parameters that can generate an effectively unlimited number of question variants. This flexibility is valuable, but it makes evaluation expensive -- especially when the goal is not just determining an average score, but reliably identifying a model's weak spots. This paper introduces a new methodology for identifying hard questions in dynamic benchmarks. It leverages COUP, a recent Bayesian optimization algorithm (Graham, Velez & Leyton-Brown, 2026), after introducing several substantive modifications to make the algorithm suitable for practical LLM pipelines. We also wrap it in a tool that supports flexible choices of datasets and utility functions, enabling users to target the kinds of questions they care about (e.g., low-accuracy questions; questions that are unusually…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
