An Improved System for Sentence-level Novelty Detection in Textual   Streams

Xinyu Fu; Eugene Ch'ng; Uwe Aickelin; Lanyun Zhang

arXiv:1605.00122·cs.IR·June 6, 2017

An Improved System for Sentence-level Novelty Detection in Textual Streams

Xinyu Fu, Eugene Ch'ng, Uwe Aickelin, Lanyun Zhang

PDF

TL;DR

This paper introduces a new event detection system that combines incremental TF-IDF and Locality Sensitive Hashing to improve sentence-level novelty detection in large, unpredictable textual streams, demonstrating significant performance gains.

Contribution

The paper presents a novel event detection system that adaptively updates the vector space model using incremental TF-IDF and LSH, outperforming existing baselines.

Findings

01

Outperforms baseline by ~16% in miss probability.

02

Efficiently adapts to new terms in large data streams.

03

Effective in dynamic, unpredictable news data environments.

Abstract

Novelty detection in news events has long been a difficult problem. A number of models performed well on specific data streams but certain issues are far from being solved, particularly in large data streams from the WWW where unpredictability of new terms requires adaptation in the vector space model. We present a novel event detection system based on the Incremental Term Frequency-Inverse Document Frequency (TF-IDF) weighting incorporated with Locality Sensitive Hashing (LSH). Our system could efficiently and effectively adapt to the changes within the data streams of any new terms with continual updates to the vector space model. Regarding miss probability, our proposed novelty detection framework outperforms a recognised baseline system by approximately 16% when evaluating a benchmark dataset from Google News.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.