Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry
Zhuochun Li, Yong Zhang, Ming Li, Yuelyu Ji, Yiming Zeng, Ning Cheng, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao, Daqing He

TL;DR
This paper proposes a new evaluation paradigm using small language models' internal representations, which are more efficient, reliable, and interpretable than traditional prompting-based methods with large models.
Contribution
It introduces the Semantic Capacity Asymmetry Hypothesis and the INSPECTOR framework, shifting evaluation from output generation to internal representation probing.
Findings
INSPECTOR outperforms prompting-based small LMs on reasoning benchmarks.
Small models encode rich evaluative signals in hidden states despite weak generative ability.
Representation-as-a-Judge offers a scalable, interpretable alternative to large LLMs for evaluation.
Abstract
Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to…
Peer Reviews
Decision·ICLR 2026 Poster
**Novel idea for LLM-as-a-Judge.** The concept of semantic capacity asymmetry is creative and appealing. It provides a fresh lens for understanding evaluation-relevant signals in internal representations, suggesting that weak evaluation performance in small LMs may arise from surface-level generation limitations rather than an inherent lack of semantic understanding. **Effective probing design.** The paper designs probing mechanisms to capture evaluation-related signals from model representatio
**1. Missing intuition behind the probing design.** The paper introduces probing experiments to support the idea of semantic capacity asymmetry, but the intuition for the probing setup is underexplained. It is unclear what specific representational property the probes aim to capture or why this design shows semantic capacity. **2.Limited dataset coverage.** The evaluation focuses on three reasoning benchmarks (GSM8K, MATH, and GPQA), all within the mathematics and science domain. Other tasks ty
1. The paper reframes model evaluation as Representation-as-a-Judge, shifting from prompt-based evaluation to representation-based probing. 2. By leveraging internal representations from small open-source LMs instead of large proprietary models, INSPECTOR offers a lightweight and interpretable alternative that significantly reduces computational cost. 3. The framework achieves high predictive accuracy on reasoning benchmarks (e.g., GSM8K, MATH, GPQA), demonstrating that small models (1.7B) can a
1. Experiments are primarily focused on reasoning benchmarks; it is unclear whether the method generalizes to other domains such as dialogue, summarization, or open-ended generation. 2. The performance quality of the judge (small LM) could be dependent on how the evaluation criteria are established.
1. The Representation-as-a-Judge framework is technically sound, which could be an alternative to the prevalent “LLM-as-a-Judge” approach. 2. The approach enables efficient evaluation using smaller, open-source models instead of proprietary LLMs. 3. The empirical results are strong, showing significant gains over some prompting and fine-tuning baselines.
1. A limitation is that the method still requires a powerful LLM in the loop to obtain initial evaluation scores (for training data). This paper assumes the LLM’s scores are gold-standard. It would strengthen the work to either validate against human ratings or discuss the implications of this dependency. 2. The probing classifiers achieve relatively low accuracy on fine-grained multiclass (1–5) predictions. This might limit the method’s use if one requires precise scoring, and also indicates s
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Computational and Text Analysis Methods
