Loading paper
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation | Tomesphere