SimDoc: Topic Sequence Alignment based Document Similarity Framework

Gaurav Maheshwari; Priyansh Trivedi; Harshita Sahijwani; Kunal Jha,; Sourish Dasgupta; Jens Lehmann

arXiv:1611.04822·cs.CL·November 15, 2017·2 cites

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Gaurav Maheshwari, Priyansh Trivedi, Harshita Sahijwani, Kunal Jha,, Sourish Dasgupta, Jens Lehmann

PDF

Open Access

TL;DR

SimDoc introduces a novel framework that models documents as topic sequences and uses sequence alignment to accurately measure semantic similarity, outperforming traditional bag-of-words methods in clustering tasks.

Contribution

The paper presents a new semantic similarity framework based on topic-sequence modeling and sequence alignment, capturing thematic flow often ignored by existing methods.

Findings

01

SimDoc outperforms bag-of-words techniques in accuracy

02

Effective in document clustering applications

03

Introduces a novel topic-topic similarity measure

Abstract

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Natural Language Processing Techniques