My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple   Domain-Specific Text Collections

Julian Risch; Ralf Krestel

arXiv:1911.11240·cs.IR·November 27, 2019·1 cites

My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

Julian Risch, Ralf Krestel

PDF

Open Access 1 Repo

TL;DR

This paper introduces an entropy-based cross-collection topic model that effectively distinguishes domain-specific and general words, improving coherence, perplexity, and classification accuracy across diverse text collections.

Contribution

The paper presents a novel entropy-based topic modeling approach that separates collection-specific and independent words, enhancing interpretability and performance in multi-collection text analysis.

Findings

01

Achieves up to 13% higher topic coherence

02

Achieves up to 4% lower perplexity

03

Achieves up to 31% higher classification accuracy

Abstract

Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

julian-risch/JCDL2018
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Text Analysis Techniques · Text and Document Classification Technologies