The Evaluation Game: Beyond Static LLM Benchmarking
Paul Wang, Jade Garcia-Bourr\'ee, Anne-Marie Kermarrec, Vincent Corruble

TL;DR
This paper introduces a game-theoretic framework to analyze the robustness of large language models against jailbreaks, emphasizing the importance of data augmentation and local generalization in fine-tuning.
Contribution
It formalizes the interaction between evaluators and trainers as a two-player game using group actions, providing new insights into adversarial robustness and evaluation dynamics.
Findings
Evaluator maintains a constant miss ratio below a critical threshold.
Fine-tuning induces local generalization, with refusal rates correlated to prompt distance.
The framework recasts benchmarks as orbits under group actions, challenging static evaluation methods.
Abstract
As jailbreaks, adversarially crafted inputs that bypass safety constraints, continue to be discovered in Large Language Models, practitioners increasingly rely on fine-tuning as a defensive strategy. Yet the theoretical foundations underlying this robustness fine-tuning remain underexplored. We introduce a game-theoretic framework in which the interaction between an evaluator (auditing the model for jailbreaks) and a trainer is formalized as a two-player game. A key feature of our approach is the use of group actions, a mathematical structure that captures symmetries and transformations, to formally represent data augmentation. The simplest non-trivial instance is the circle with cyclic translation groups, where we exhibit various regimes depending on the trainer's generalization range. Below a critical threshold, the evaluator maintains a constant miss ratio for linearly many rounds,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
