On Evaluating LLM Alignment by Evaluating LLMs as Judges
Yixin Liu, Pengfei Liu, Arman Cohan

TL;DR
This paper explores how large language models (LLMs) can be evaluated for alignment with human preferences by analyzing their evaluation capabilities, proposing a new benchmark that assesses alignment through evaluation performance rather than direct output assessment.
Contribution
It introduces AlignEval, a benchmark that measures LLM alignment by evaluating their judging ability, revealing a strong correlation between generation and evaluation capabilities of LLMs.
Findings
AlignEval matches or exceeds existing benchmarks in capturing human preferences.
Strong correlation found between LLMs' generation and evaluation capabilities.
Proposes a new paradigm for assessing LLM alignment without direct output evaluation.
Abstract
Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
