Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

Angel Rodrigo Avelar Menendez; Yufeng Liu; Xiaowu Dai

arXiv:2603.03336·cs.CL·March 5, 2026

Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

Angel Rodrigo Avelar Menendez, Yufeng Liu, Xiaowu Dai

PDF

Open Access

TL;DR

This paper introduces a statistically rigorous framework for ranking large language models based on human preferences that accounts for uncertainty and prompt dependency, improving decision-making robustness.

Contribution

It develops a novel method for prompt-dependent ranking inference with valid uncertainty quantification, moving beyond fixed point estimates of model utilities.

Findings

01

Rankings vary significantly across different prompts.

02

Many apparent rank differences are not statistically significant.

03

Uncertainty-aware rankings prevent misleading conclusions by acknowledging data limitations.

Abstract

Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Mobile Crowdsensing and Crowdsourcing · Big Data and Digital Economy