LLM-as-a-qualitative-judge: automating error analysis in natural language generation
Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perli\'c, Ekaterina Borisova, Markarit Vartampetian

TL;DR
This paper introduces a novel LLM-based evaluation method that generates structured qualitative reports on NLG system errors, aiding developers in system improvement with high agreement to human annotations.
Contribution
The work presents a new approach for qualitative error analysis in NLG using LLMs, including a clustering algorithm and evaluation strategy, with validation on multiple datasets.
Findings
LLM-as-a-qualitative-judge matches human annotations in 2/3 cases
It produces error reports similar to human annotators
Use of the method improves NLG system performance
Abstract
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
