An Empirical Analysis on Large Language Models in Debate Evaluation

Xinyi Liu; Pinxin Liu; Hangfeng He

arXiv:2406.00050·cs.CL·June 5, 2024

An Empirical Analysis on Large Language Models in Debate Evaluation

Xinyi Liu, Pinxin Liu, Hangfeng He

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper evaluates large language models like GPT-3.5 and GPT-4 in debate assessment, revealing they outperform humans and existing methods but exhibit biases such as positional, lexical, and end-of-discussion biases affecting their judgments.

Contribution

It provides the first comprehensive analysis of LLMs' debate evaluation capabilities and uncovers inherent biases influencing their performance.

Findings

01

LLMs outperform humans and state-of-the-art methods in debate evaluation.

02

Identifies positional, lexical, and end-of-discussion biases in LLM judgments.

03

Biases are influenced by prompt design and label verbalizer choices.

Abstract

In this study, we investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation. We discover that LLM's performance exceeds humans and surpasses the performance of state-of-the-art methods fine-tuned on extensive datasets in debate evaluation. We additionally explore and analyze biases present in LLMs, including positional bias, lexical bias, order bias, which may affect their evaluative judgments. Our findings reveal a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented, attributed to prompt design. We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential, highlighting the critical need for careful label verbalizer selection in prompt design. Additionally, our analysis indicates a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xinyiliu0227/llm_debate_bias
noneOfficial

Videos

An Empirical Analysis on Large Language Models in Debate Evaluation· underline

Taxonomy

TopicsComputational and Text Analysis Methods

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Softmax · {Dispute@FaQ-s}How to file a dispute with Expedia? · Layer Normalization · Weight Decay · Attention Dropout · Linear Layer