Large Language Models are not Fair Evaluators

Peiyi Wang; Lei Li; Liang Chen; Zefan Cai; Dawei Zhu and; Binghuai Lin; Yunbo Cao; Qi Liu; Tianyu Liu; Zhifang Sui

arXiv:2305.17926·cs.CL·August 31, 2023·30 cites

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu and, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper reveals biases in using large language models as evaluators for AI responses, demonstrating how simple manipulations can skew results, and proposes calibration strategies to improve fairness and alignment with human judgments.

Contribution

The paper identifies systematic biases in LLM-based evaluation methods and introduces a calibration framework to mitigate these biases, enhancing evaluation fairness and reliability.

Findings

01

Evaluation bias can be exploited by response order manipulation.

02

Calibration strategies reduce bias and improve alignment with human judgments.

03

Proposed methods are effective across multiple evaluation scenarios.

Abstract

In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large language models~(LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. To address this issue, we propose a calibration framework with three simple yet effective strategies: 1) Multiple Evidence Calibration, which requires the evaluator model to generate multiple evaluation evidence before assigning ratings; 2) Balanced Position Calibration, which aggregates results across various orders to determine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

i-eval/faireval
noneOfficial

Videos

Large Language Models are not Fair Evaluators· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)

MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · Byte Pair Encoding · Softmax · Label Smoothing · Dropout · Residual Connection · Linear Layer · Absolute Position Encodings