Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment   in Large Language Models Evaluation

Zebin Wang; Yi Han; Ethan X. Fang; Lan Wang; Junwei Lu

arXiv:2412.05506·stat.ML·February 11, 2025

Confidence Diagram of Nonparametric Ranking for Uncertainty Assessment in Large Language Models Evaluation

Zebin Wang, Yi Han, Ethan X. Fang, Lan Wang, Junwei Lu

PDF

Open Access

TL;DR

This paper introduces a nonparametric framework and confidence diagram methodology for statistically assessing and comparing large language models' domain-specific expertise, enhancing evaluation robustness.

Contribution

It proposes a novel confidence diagram approach based on Hasse diagrams and extends bootstrap theory for non-i.i.d. data, improving LLM evaluation methods.

Findings

01

Effective in evaluating LLMs across medical domains.

02

Provides a valid confidence set for ranking models.

03

Offers insights into model performance variability.

Abstract

We consider the inference for the ranking of large language models (LLMs). Alignment arises as a significant challenge to mitigate hallucinations in the use of LLMs. Ranking LLMs has proven to be an effective tool to improve alignment based on the best-of- $N$ policy. In this paper, we propose a new inferential framework for hypothesis testing among the ranking for language models. Our framework is based on a nonparametric contextual ranking framework designed to assess large language models' domain-specific expertise, leveraging nonparametric scoring methods to account for their sensitivity to the prompts. To characterize the combinatorial complexity of the ranking, we introduce a novel concept of confidence diagram, which leverages a Hasse diagram to represent the entire confidence set of rankings by a single directed graph. We show the validity of the proposed confidence diagram by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Bayesian Methods and Mixture Models

MethodsSparse Evolutionary Training