LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova; Tunde Oluwaseyi Ajayi; Seth Aycock; Zain Muhammad Mujahid; Vladana Perli\'c; Ekaterina Borisova; Markarit Vartampetian

arXiv:2506.09147·cs.CL·December 22, 2025

LLM-as-a-qualitative-judge: automating error analysis in natural language generation

Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perli\'c, Ekaterina Borisova, Markarit Vartampetian

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel LLM-based evaluation method that generates structured qualitative reports on NLG system errors, aiding developers in system improvement with high agreement to human annotations.

Contribution

The work presents a new approach for qualitative error analysis in NLG using LLMs, including a clustering algorithm and evaluation strategy, with validation on multiple datasets.

Findings

01

LLM-as-a-qualitative-judge matches human annotations in 2/3 cases

02

It produces error reports similar to human annotators

03

Use of the method improves NLG system performance

Abstract

Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLM-as-a-qualitative-judge: automating error analysis in natural language generation· underline

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques