Statistical Multicriteria Evaluation of LLM-Generated Text

Esteban Garces Arias; Hannah Blocher; Julian Rodemann; Matthias A{\ss}enmacher; Christoph Jansen

arXiv:2506.18082·cs.CL·June 25, 2025

Statistical Multicriteria Evaluation of LLM-Generated Text

Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Matthias A{\ss}enmacher, Christoph Jansen

PDF

1 Repo

TL;DR

This paper introduces a statistical framework using Generalized Stochastic Dominance to evaluate LLM-generated text across multiple quality dimensions, overcoming limitations of existing single-metric and simplistic evaluation methods.

Contribution

It adapts a GSD-based framework for multi-dimensional, statistically rigorous evaluation of text quality, addressing key limitations in current benchmarking practices.

Findings

01

Effective multi-criteria evaluation of LLM outputs

02

Identification of significant differences between decoding strategies and human text

03

Framework respects different measurement scales and statistical guarantees

Abstract

Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs between coherence, diversity, fluency, and other relevant indicators of text quality. In this work, we adapt a recently proposed framework for statistical inference based on Generalized Stochastic Dominance (GSD) that addresses three critical limitations in existing benchmarking methodologies: the inadequacy of single-metric evaluation, the incompatibility between cardinal automatic metrics and ordinal human judgments, and the lack of inferential statistical guarantees. The GSD-front approach enables simultaneous evaluation across multiple quality dimensions while respecting their different measurement scales, building upon partial orders of decoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hannahblo/statistical_multicriteria_evaluation_of_llm-generated_text
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.