Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

James Wedgwood; Chhavi Yadav; Virginia Smith

arXiv:2603.03319·cs.CL·March 5, 2026

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

James Wedgwood, Chhavi Yadav, Virginia Smith

PDF

Open Access

TL;DR

This paper introduces an automated method for discovering interpretable concepts behind LLM preference judgments, revealing biases and trends without relying on predefined bias categories, thus enhancing understanding of LLM evaluation behavior.

Contribution

The study presents a novel autoencoder-based approach for extracting interpretable preference features from LLM judgments, outperforming other methods in interpretability and maintaining competitive prediction accuracy.

Findings

01

LLMs prefer refusing sensitive requests more than humans.

02

Biases toward concreteness and empathy in LLM judgments.

03

Biases against legal guidance involving active steps.

Abstract

Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Topic Modeling · Computational and Text Analysis Methods