Multi-Narrative Semantic Overlap Task: Evaluation and Benchmark
Naman Bansal, Mousumi Akter, Shubhra Kanti Karmaker Santu

TL;DR
This paper introduces the Multi-Narrative Semantic Overlap task, creates a benchmark dataset, and proposes a new evaluation metric, SEM-F1, which better aligns with human judgment than existing metrics.
Contribution
It defines a new NLP task, constructs a benchmark dataset with human annotations, and develops SEM-F1, a novel evaluation metric for semantic overlap.
Findings
ROUGE is unsuitable for MNSO evaluation.
SEM-F1 correlates better with human judgment.
Benchmark dataset with 2,925 narrative pairs and 411 ground-truth overlaps.
Abstract
In this paper, we introduce an important yet relatively unexplored NLP task called Multi-Narrative Semantic Overlap (MNSO), which entails generating a Semantic Overlap of multiple alternate narratives. As no benchmark dataset is readily available for this task, we created one by crawling 2,925 narrative pairs from the web and then, went through the tedious process of manually creating 411 different ground-truth semantic overlaps by engaging human annotators. As a way to evaluate this novel task, we first conducted a systematic study by borrowing the popular ROUGE metric from text-summarization literature and discovered that ROUGE is not suitable for our task. Subsequently, we conducted further human annotations/validations to create 200 document-level and 1,518 sentence-level ground-truth labels which helped us formulate a new precision-recall style evaluation metric, called SEM-F1…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
