TL;DR
This paper highlights the inconsistency in BLEU score reporting in machine translation, quantifies the variation caused by different parameters, and proposes standardization and tools to improve comparability of results.
Contribution
It identifies the variability in BLEU scores due to parameter differences, advocates for a standardized BLEU scheme, and introduces SacreBLEU for consistent evaluation.
Findings
BLEU score differences can reach 1.8 due to parameter variations
Inconsistent reporting hampers fair comparison of MT systems
Standardized BLEU scheme improves result reproducibility
Abstract
The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SacreBLEU, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
