On the Implications of Verbose LLM Outputs: A Case Study in Translation   Evaluation

Eleftheria Briakou; Zhongtao Liu; Colin Cherry; Markus Freitag

arXiv:2410.00863·cs.CL·October 2, 2024

On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation

Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag

PDF

Open Access

TL;DR

This paper examines how verbose outputs from large language models affect translation evaluation, revealing that verbosity can unfairly penalize certain models and emphasizing the need for improved evaluation methods.

Contribution

It identifies key causes of verbosity in LLM translations and demonstrates its impact on evaluation fairness using both automatic and human assessments.

Findings

01

Verbose outputs are common across LLM translations.

02

Ignoring verbosity biases evaluation results.

03

Addressing verbosity is crucial for fair assessment.

Abstract

This paper investigates the impact of verbose LLM translations on evaluation. We first demonstrate the prevalence of this behavior across several LLM outputs drawn from the WMT 2024 general shared task on machine translation. We then identify the primary triggers of verbosity, including safety, copyright concerns, and insufficient context in short input queries. Finally, we show that ignoring this behavior unfairly penalizes more verbose LLMs according to both automatic and human evaluations, highlighting the need to address this issue for more accurate future evaluations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques