Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa; Lajanugen Logeswaran; Jaekyeom Kim; Sungryull Sohn; Yunxiang Zhang; Moontae Lee; Hao Peng; Lu Wang; Honglak Lee

arXiv:2601.14691·cs.AI·January 23, 2026

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Sungryull Sohn, Yunxiang Zhang, Moontae Lee, Hao Peng, Lu Wang, Honglak Lee

PDF

Open Access

TL;DR

Large language models used as judges for agent performance are highly vulnerable to manipulation of agent reasoning traces, which can significantly inflate false positive rates and undermine evaluation reliability.

Contribution

This paper demonstrates that LLM-based evaluation methods are susceptible to reasoning manipulation and proposes the need for verification mechanisms to ensure assessment integrity.

Findings

01

Manipulated reasoning can inflate false positive rates by up to 90%.

02

Content-based reasoning manipulations are more effective than style-based ones.

03

Prompting techniques and increased compute reduce but do not eliminate manipulation susceptibility.

Abstract

Large language models (LLMs) are increasingly used as judges to evaluate agent performance, particularly in non-verifiable settings where judgments rely on agent trajectories including chain-of-thought (CoT) reasoning. This paradigm implicitly assumes that the agent's CoT faithfully reflects both its internal reasoning and the underlying environment state. We show this assumption is brittle: LLM judges are highly susceptible to manipulation of agent reasoning traces. By systematically rewriting agent CoTs while holding actions and observations fixed, we demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks. We study manipulation strategies spanning style-based approaches that alter only the presentation of reasoning and content-based approaches that fabricate signals of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Ethics and Social Impacts of AI