Revisiting Summarization Evaluation for Scientific Articles
Arman Cohan, Nazli Goharian

TL;DR
This paper critically examines the limitations of ROUGE for scientific article summarization evaluation and introduces SERA, a new relevance-based metric that correlates better with manual assessments.
Contribution
The paper reveals ROUGE's unreliability in scientific summarization and proposes SERA, a content relevance metric with higher correlation to manual scores.
Findings
ROUGE shows low reliability for scientific summaries.
Different ROUGE variants yield inconsistent correlations with manual scores.
SERA outperforms ROUGE in correlating with manual evaluation metrics.
Abstract
Evaluation of text summarization approaches have been mostly based on metrics that measure similarities of system generated summaries with a set of human written gold-standard summaries. The most widely used metric in summarization evaluation has been the ROUGE family. ROUGE solely relies on lexical overlaps between the terms and phrases in the sentences; therefore, in cases of terminology variations and paraphrasing, ROUGE is not as effective. Scientific article summarization is one such case that is different from general domain summarization (e.g. newswire data). We provide an extensive analysis of ROUGE's effectiveness as an evaluation metric for scientific summarization; we show that, contrary to the common belief, ROUGE is not much reliable in evaluating scientific summaries. We furthermore show how different variants of ROUGE result in very different correlations with the manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
