Judge Circuits

Nils Feldhus; Tanja Baeumel; Elena Golimblevskaia; Qianli Wang; Van Bach Nguyen; Aaron Louis Eidt; Christopher Ebert; Wojciech Samek; Jing Yang; Vera Schmitt; Sebastian M\"oller; Simon Ostermann

arXiv:2605.16023·cs.CL·May 18, 2026

Judge Circuits

Nils Feldhus, Tanja Baeumel, Elena Golimblevskaia, Qianli Wang, Van Bach Nguyen, Aaron Louis Eidt, Christopher Ebert, Wojciech Samek, Jing Yang, Vera Schmitt, Sebastian M\"oller, Simon Ostermann

PDF

TL;DR

This paper investigates how large language models' judgment consistency is affected by output formatting, revealing a shared internal sub-graph responsible for evaluations and how formatting impacts judgment signals.

Contribution

It introduces PEAP to causally analyze internal mechanisms, identifying a shared latent evaluator sub-graph and decoupling judgment from output format.

Findings

01

Judgments across tasks share a sparse, generalized sub-graph in MLPs.

02

Zero-ablating this sub-graph collapses judgment but preserves world knowledge.

03

Format-specific terminal branches cause format-induced judgment inconsistencies.

Abstract

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.