Exploring the Reliability of Large Language Models as Customized   Evaluators for Diverse NLP Tasks

Qintong Li; Leyang Cui; Lingpeng Kong; Wei Bi

arXiv:2310.19740·cs.CL·January 22, 2025·1 cites

Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks

Qintong Li, Leyang Cui, Lingpeng Kong, Wei Bi

PDF

Open Access 1 Repo

TL;DR

This paper investigates the reliability of large language models as evaluators for diverse NLP tasks, highlighting their strengths and limitations compared to human evaluators, and proposing pre-drafting to improve evaluation objectivity.

Contribution

It provides a detailed analysis of LLM evaluators' alignment with human judgments across various NLP tasks and introduces pre-drafting as a method to enhance evaluation consistency.

Findings

01

LLM evaluators sometimes omit or add unnecessary criteria.

02

LLMs perform well on general criteria like fluency.

03

Challenges remain for complex criteria such as numerical reasoning.

Abstract

Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether LLMs can serve as reliable alternatives to humans, we examine the fine-grained alignment between LLM evaluators and human annotators, particularly in understanding the target evaluation tasks and conducting evaluations that meet diverse criteria. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning), each with different evaluation criteria. Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qtli/coeval
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research