Can LLM be a Personalized Judge?
Yijiang River Dong, Tiancheng Hu, Nigel Collier

TL;DR
This paper critically examines the reliability of using LLMs as personalized judges for user preferences, introduces uncertainty estimation to improve judgment accuracy, and demonstrates promising results comparable to human evaluation.
Contribution
It reveals limitations of current LLM-as-a-Judge methods and proposes a certainty-aware approach that significantly enhances judgment reliability and agreement with human ground truth.
Findings
Low agreement of LLM-as-a-Judge with human ground truth
Verbal uncertainty estimation improves judgment accuracy
Achieves over 80% agreement on high-certainty samples
Abstract
Ensuring that large language models (LLMs) reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate the reliability of LLM-as-a-Personalized-Judge, asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLegal Education and Practice Innovations · Legal Systems and Judicial Processes · Judicial and Constitutional Studies
