TAP-DLND 1.0 : A Corpus for Document Level Novelty Detection
Tirthankar Ghosal, Amitra Salam, Swati Tiwari, Asif Ekbal, Pushpak, Bhattacharyya

TL;DR
This paper introduces TAP-DLND 1.0, a new annotated corpus for benchmarking document-level novelty detection, addressing a key gap in evaluation resources for NLP applications like summarization and news tracking.
Contribution
It provides the first comprehensive, event-specific, periodically updated corpus for document novelty detection, along with baseline system evaluations.
Findings
Created a large, annotated news corpus for novelty detection
Demonstrated the corpus's utility with a baseline system
Provided statistical insights into the dataset
Abstract
Detecting novelty of an entire document is an Artificial Intelligence (AI) frontier problem that has widespread NLP applications, such as extractive document summarization, tracking development of news events, predicting impact of scholarly articles, etc. Important though the problem is, we are unaware of any benchmark document level data that correctly addresses the evaluation of automatic novelty detection techniques in a classification framework. To bridge this gap, we present here a resource for benchmarking the techniques for document level novelty detection. We create the resource via event-specific crawling of news documents across several domains in a periodic manner. We release the annotated corpus with necessary statistics and show its use with a developed system for the problem in concern.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Text Analysis Techniques · Software Engineering Research
