Poor-Supervised Evaluation for SuperLLM via Mutual Consistency
Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda, Wang, Yao Hu, Kan Li

TL;DR
This paper introduces PoEM, a framework for evaluating large language models without relying on accurate labels, using mutual consistency with reference models, achieving high correlation with traditional supervised evaluations.
Contribution
The paper proposes a novel poor-supervised evaluation method for LLMs based on mutual consistency, extending evaluation paradigms to include models as references alongside humans.
Findings
PoEM achieves 0.98 Pearson correlation with supervised evaluation.
It demonstrates effectiveness, efficiency, and generalizability across multiple tasks and models.
It shifts evaluation from human-centric to human&model-centric paradigm.
Abstract
The guidance from capability evaluations has greatly propelled the progress of both human society and Artificial Intelligence. However, as LLMs evolve, it becomes challenging to construct evaluation benchmarks for them with accurate labels on hard tasks that approach the boundaries of human capabilities. To credibly conduct evaluation without accurate labels (denoted as poor-supervised evaluation), we propose the PoEM framework. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model, when their prediction distributions are independent and the sample size is infinite. To alleviate the insufficiencies of the conditions in reality, we further introduce an algorithm that treats humans (when available) and the models under evaluation as reference models, alternately conducting model weights calibration and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM · Face and Expression Recognition · Neural Networks and Applications
