Automated Concept Discovery for LLM-as-a-Judge Preference Analysis
James Wedgwood, Chhavi Yadav, Virginia Smith

TL;DR
This paper introduces an automated method for discovering interpretable concepts behind LLM preference judgments, revealing biases and trends without relying on predefined bias categories, thus enhancing understanding of LLM evaluation behavior.
Contribution
The study presents a novel autoencoder-based approach for extracting interpretable preference features from LLM judgments, outperforming other methods in interpretability and maintaining competitive prediction accuracy.
Findings
LLMs prefer refusing sensitive requests more than humans.
Biases toward concreteness and empathy in LLM judgments.
Biases against legal guidance involving active steps.
Abstract
Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Topic Modeling · Computational and Text Analysis Methods
