TL;DR
This paper evaluates the effectiveness of n-gram and neural-based metrics for multilingual summarization, revealing language-dependent limitations of traditional metrics and advocating for neural evaluation methods, especially in morphologically rich languages.
Contribution
It provides a large-scale, cross-linguistic assessment of evaluation metrics, highlighting their limitations and demonstrating the advantages of neural-based metrics like COMET for diverse languages.
Findings
N-gram metrics correlate poorly with human judgments in fusional languages.
Proper tokenization improves n-gram metric performance in morphologically rich languages.
Neural-based metrics like COMET outperform traditional metrics across languages.
Abstract
Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
