SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels
Elena Shushkevich, Long Mai, Manuel V. Loureiro, Steven Derby, Tri, Kurniawan Wijaya

TL;DR
The paper introduces SPICED, a comprehensive news similarity dataset across multiple topics and complexity levels, designed to improve the training of models in detecting redundant news content.
Contribution
It presents a novel multi-topic, multi-level news similarity dataset and benchmarks several models on this dataset, addressing the lack of topic-specific datasets.
Findings
Models show varied performance across topics and complexity levels.
Topic segmentation improves model training effectiveness.
Benchmark results highlight strengths and weaknesses of different models.
Abstract
The proliferation of news media outlets has increased the demand for intelligent systems capable of detecting redundant information in news articles in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a novel dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science &…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Advanced Text Analysis Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Layer Normalization · WordPiece · Dense Connections · Linear Layer · Softmax · Residual Connection · Attention Dropout · Dropout
