Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks
Qintong Li, Leyang Cui, Lingpeng Kong, Wei Bi

TL;DR
This paper investigates the reliability of large language models as evaluators for diverse NLP tasks, highlighting their strengths and limitations compared to human evaluators, and proposing pre-drafting to improve evaluation objectivity.
Contribution
It provides a detailed analysis of LLM evaluators' alignment with human judgments across various NLP tasks and introduces pre-drafting as a method to enhance evaluation consistency.
Findings
LLM evaluators sometimes omit or add unnecessary criteria.
LLMs perform well on general criteria like fluency.
Challenges remain for complex criteria such as numerical reasoning.
Abstract
Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether LLMs can serve as reliable alternatives to humans, we examine the fine-grained alignment between LLM evaluators and human annotators, particularly in understanding the target evaluation tasks and conducting evaluations that meet diverse criteria. This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning), each with different evaluation criteria. Our analysis shows that 1) LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts. 2) LLM evaluators excel in general criteria, such as fluency, but face challenges with complex criteria, such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
