Poor-Supervised Evaluation for SuperLLM via Mutual Consistency

Peiwen Yuan; Shaoxiong Feng; Yiwei Li; Xinglin Wang; Boyuan Pan; Heda; Wang; Yao Hu; Kan Li

arXiv:2408.13738·cs.CL·August 27, 2024

Poor-Supervised Evaluation for SuperLLM via Mutual Consistency

Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda, Wang, Yao Hu, Kan Li

PDF

Open Access

TL;DR

This paper introduces PoEM, a framework for evaluating large language models without relying on accurate labels, using mutual consistency with reference models, achieving high correlation with traditional supervised evaluations.

Contribution

The paper proposes a novel poor-supervised evaluation method for LLMs based on mutual consistency, extending evaluation paradigms to include models as references alongside humans.

Findings

01

PoEM achieves 0.98 Pearson correlation with supervised evaluation.

02

It demonstrates effectiveness, efficiency, and generalizability across multiple tasks and models.

03

It shifts evaluation from human-centric to human&model-centric paradigm.

Abstract

The guidance from capability evaluations has greatly propelled the progress of both human society and Artificial Intelligence. However, as LLMs evolve, it becomes challenging to construct evaluation benchmarks for them with accurate labels on hard tasks that approach the boundaries of human capabilities. To credibly conduct evaluation without accurate labels (denoted as poor-supervised evaluation), we propose the PoEM framework. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model, when their prediction distributions are independent and the sample size is infinite. To alleviate the insufficiencies of the conditions in reality, we further introduce an algorithm that treats humans (when available) and the models under evaluation as reference models, alternately conducting model weights calibration and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and ELM · Face and Expression Recognition · Neural Networks and Applications