Ranking Large Language Models without Ground Truth

Amit Dhurandhar; Rahul Nair; Moninder Singh; Elizabeth Daly and; Karthikeyan Natesan Ramamurthy

arXiv:2402.14860·cs.CL·June 11, 2024·1 cites

Ranking Large Language Models without Ground Truth

Amit Dhurandhar, Rahul Nair, Moninder Singh, Elizabeth Daly and, Karthikeyan Natesan Ramamurthy

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel method for ranking large language models without relying on ground truth data, using triplet evaluations where models assess each other to identify the weakest, enabling reliable ranking across various tasks.

Contribution

The paper proposes a new approach to rank LLMs without ground truth, using triplet-based evaluations and theoretical analysis to ensure success.

Findings

01

Methods reliably recover true rankings without reference data

02

Effective across summarization, multiple-choice, and dialog tasks

03

Provides a low-resource alternative to traditional evaluation

Abstract

Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of LLMs to evaluate each other which can be unreliable. In this paper, we provide a novel perspective where, given a dataset of prompts (viz. questions, instructions, etc.) and a set of LLMs, we rank them without access to any ground truth or reference responses. Inspired by real life where both an expert and a knowledgeable person can identify a novice our main idea is to consider triplets of models, where each one of them evaluates the other two, correctly identifying the worst model in the triplet with high probability. We also analyze our idea and provide sufficient conditions for it to succeed. Applying this idea repeatedly, we propose two methods…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Ranking Large Language Models without Ground Truth· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsSparse Evolutionary Training