In Case You Missed It: ARC 'Challenge' Is Not That Challenging
{\L}ukasz Borchmann

TL;DR
This paper reveals that the perceived difficulty of the ARC Challenge benchmark is largely due to evaluation setup issues, and proposes fairer methods that better reflect true model reasoning abilities, often leading to superhuman performance.
Contribution
It identifies evaluation biases in ARC and other benchmarks, demonstrating how fairer assessment methods significantly reduce perceived difficulty and improve the accuracy of model capability measurement.
Findings
Fair evaluation methods reduce performance gaps in benchmarks.
Superhuman results achieved with improved evaluation schemes.
Evaluation setup influences perceived difficulty more than inherent task complexity.
Abstract
ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHealthcare Policy and Management
