Evaluating the Efficacy of Summarization Evaluation across Languages
Fajri Koto, Jey Han Lau, Timothy Baldwin

TL;DR
This paper systematically assesses how well automatic summarization evaluation metrics work across eight languages, finding multilingual BERT-based metrics perform consistently well beyond English.
Contribution
It introduces a panlinguistic evaluation framework for summarization metrics and demonstrates the effectiveness of multilingual BERT-based metrics across multiple languages.
Findings
Multilingual BERT within BERTScore outperforms other metrics across all tested languages.
Evaluation metrics show consistent performance above English benchmarks.
Systematic annotation approach for focus and coverage across languages.
Abstract
While automatic summarization evaluation methods developed for English are routinely applied to other languages, this is the first attempt to systematically quantify their panlinguistic efficacy. We take a summarization corpus for eight different languages, and manually annotate generated summaries for focus (precision) and coverage (recall). Based on this, we evaluate 19 summarization evaluation metrics, and find that using multilingual BERT within BERTScore performs well across all languages, at a level above that for English.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Linear Decay · Layer Normalization · Residual Connection · WordPiece · Attention Dropout
