A Call for Clarity in Reporting BLEU Scores

Matt Post

arXiv:1804.08771·cs.CL·September 13, 2018

A Call for Clarity in Reporting BLEU Scores

Matt Post

PDF

2 Repos

TL;DR

This paper highlights the inconsistency in BLEU score reporting in machine translation, quantifies the variation caused by different parameters, and proposes standardization and tools to improve comparability of results.

Contribution

It identifies the variability in BLEU scores due to parameter differences, advocates for a standardized BLEU scheme, and introduces SacreBLEU for consistent evaluation.

Findings

01

BLEU score differences can reach 1.8 due to parameter variations

02

Inconsistent reporting hampers fair comparison of MT systems

03

Standardized BLEU scheme improves result reproducibility

Abstract

The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SacreBLEU, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.