Detecting Proxy Gaming in RL and LLM Alignment via Evaluator Stress Tests
Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

TL;DR
This paper introduces the Evaluator Stress Test (EST), a framework to detect proxy gaming in AI systems by identifying exploitable sensitivities, validated across reinforcement learning and large language model alignment domains, with promising results and mitigation strategies.
Contribution
The paper presents EST, a novel invariance-based method for detecting proxy gaming, applicable across RL and LLM alignment, with comprehensive validation and mitigation approaches.
Findings
EST achieves over 78% precision and recall in RL environments.
EST achieves over 74% precision and recall in LLM tasks.
Closed-loop mitigation significantly reduces hacking and improves human win-rate.
Abstract
Proxy optimization, where AI systems exploit evaluator weaknesses rather than improve intended objectives, threatens both reinforcement learning (reward hacking) and LLM alignment (evaluator gaming). We introduce the Evaluator Stress Test (EST), an invariance-based framework that detects proxy gaming by separating exploitable sensitivity (e.g., formatting artifacts, physics bugs) from content-driven improvements using controlled perturbations with semantic validity audits. We validate EST across both domains. In RL, across 15 environments and 5 algorithms (2,156 expert-annotated episodes), EST achieves 78.4% precision and 81.7% recall. In LLM alignment, across 4 tasks, 2 model scales, 2 training methods, and 2 judges (1,200 human-annotated instances), EST achieves 74.2% precision and 78.6% recall, with early warning signals that precede quality decline. Cross-domain analysis shows that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Artificial Intelligence in Games
MethodsSoftmax · A3C · Q-Learning · Convolution · Dense Connections · Deep Q-Network
