Personalized Benchmarking: Evaluating LLMs by Individual Preferences
Cristina Garbacea, Heran Wang, Chenhao Tan

TL;DR
This paper advocates for personalized benchmarking of LLMs, demonstrating significant divergence from aggregate rankings and proposing methods to predict individual user preferences based on query characteristics.
Contribution
It introduces personalized LLM benchmarks using ELO and Bradley-Terry ratings, revealing substantial preference heterogeneity and proposing feature-based prediction of user-specific rankings.
Findings
Individual LLM rankings vary greatly from aggregate rankings.
User preferences are influenced by topical interests and communication styles.
A feature space based on topic and style can predict user-specific model rankings.
Abstract
With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only (57\% of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
