An Analysis on Automated Metrics for Evaluating Japanese-English Chat Translation
Andre Rusli, Makoto Shishido

TL;DR
This study compares traditional and neural-based metrics for evaluating Japanese-English chat translation, finding neural metrics better align with human judgment but still face challenges with complex linguistic phenomena.
Contribution
It provides a comprehensive analysis of metric performance in chat translation, highlighting the strengths and limitations of both traditional and neural-based evaluation methods.
Findings
Neural metrics outperform traditional metrics in correlating with human judgments.
All metrics reliably rank NMT models in chat translation performance.
Metrics struggle with sentences containing Japanese anaphoric zero-pronouns.
Abstract
This paper analyses how traditional baseline metrics, such as BLEU and TER, and neural-based methods, such as BERTScore and COMET, score several NMT models performance on chat translation and how these metrics perform when compared to human-annotated scores. The results show that for ranking NMT models in chat translations, all metrics seem consistent in deciding which model outperforms the others. This implies that traditional baseline metrics, which are faster and simpler to use, can still be helpful. On the other hand, when it comes to better correlation with human judgment, neural-based metrics outperform traditional metrics, with COMET achieving the highest correlation with the human-annotated score on a chat translation. However, we show that even the best metric struggles when scoring English translations from sentences with anaphoric zero-pronoun in Japanese.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
