Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation
Zebin Wang, Yi Han, Ethan X. Fang, Lan Wang, Junwei Lu

TL;DR
This paper introduces a nonparametric framework and confidence diagram methodology for statistically assessing and comparing large language models' domain-specific expertise, enhancing evaluation robustness.
Contribution
It proposes a novel confidence diagram approach based on Hasse diagrams and extends bootstrap theory for non-i.i.d. data, improving LLM evaluation methods.
Findings
Effective in evaluating LLMs across medical domains.
Provides a valid confidence set for ranking models.
Offers insights into model performance variability.
Abstract
We consider the inference for the ranking of large language models (LLMs). Alignment arises as a significant challenge to mitigate hallucinations in the use of LLMs. Ranking LLMs has proven to be an effective tool to improve alignment based on the best-of- policy. In this paper, we propose a new inferential framework for hypothesis testing among the ranking for language models. Our framework is based on a nonparametric contextual ranking framework designed to assess large language models' domain-specific expertise, leveraging nonparametric scoring methods to account for their sensitivity to the prompts. To characterize the combinatorial complexity of the ranking, we introduce a novel concept of confidence diagram, which leverages a Hasse diagram to represent the entire confidence set of rankings by a single directed graph. We show the validity of the proposed confidence diagram by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Bayesian Methods and Mixture Models
MethodsSparse Evolutionary Training
