Resurrecting saturated LLM benchmarks with adversarial encoding

Igor Ivanov; Dmitrii Volkov

arXiv:2502.06738·cs.LG·February 11, 2025

Resurrecting saturated LLM benchmarks with adversarial encoding

Igor Ivanov, Dmitrii Volkov

PDF

Open Access

TL;DR

This paper demonstrates that modifying benchmark questions with pairing and additional answer options can reveal the true capabilities of large language models by preventing performance saturation.

Contribution

It introduces a method to resurface the difficulty of existing benchmarks by adversarially modifying questions, thus providing a more accurate assessment of LLMs.

Findings

01

Modified benchmarks show reduced model performance, indicating previous saturation.

02

The approach can effectively resurrect and extend the utility of older benchmarks.

03

Capable models' performance is more accurately reflected after modifications.

Abstract

Recent work showed that small changes in benchmark questions can reduce LLMs' reasoning and recall. We explore two such changes: pairing questions and adding more answer options, on three benchmarks: WMDP-bio, GPQA, and MMLU variants. We find that for more capable models, these predictably reduce performance, essentially heightening the performance ceiling of a benchmark and unsaturating it again. We suggest this approach can resurrect old benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques · Machine Learning and Algorithms