Parameterizing Kterm Hashing

Dominik Wurzer; Yumeng Qin

arXiv:2208.01340·cs.IR·August 3, 2022

Parameterizing Kterm Hashing

Dominik Wurzer, Yumeng Qin

PDF

TL;DR

This paper enhances Kterm Hashing for novelty detection in data streams by introducing parameterized weights for kterms, leading to improved detection accuracy over the traditional uniform weighting method.

Contribution

It proposes a novel parameterization of Kterm Hashing that assigns weights to kterms based on their importance, improving effectiveness in novelty detection tasks.

Findings

01

Parameterized Kterm Hashing outperforms uniform weighting in accuracy.

02

Significant improvement in First Story Detection performance.

03

Scalable to large data streams without losing accuracy.

Abstract

Kterm Hashing provides an innovative approach to novelty detection on massive data streams. Previous research focused on maximizing the efficiency of Kterm Hashing and succeeded in scaling First Story Detection to Twitter-size data stream without sacrificing detection accuracy. In this paper, we focus on improving the effectiveness of Kterm Hashing. Traditionally, all kterms are considered as equally important when calculating a document's degree of novelty with respect to the past. We believe that certain kterms are more important than others and hypothesize that uniform kterm weights are sub-optimal for determining novelty in data streams. To validate our hypothesis, we parameterize Kterm Hashing by assigning weights to kterms based on their characteristics. Our experiments apply Kterm Hashing in a First Story Detection setting and reveal that parameterized Kterm Hashing can surpass…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.