Scalable Text Mining with Sparse Generative Models

Antti Puurula

arXiv:1602.02332·cs.IR·February 9, 2016·1 cites

Scalable Text Mining with Sparse Generative Models

Antti Puurula

PDF

Open Access

TL;DR

This paper introduces scalable text mining methods using sparse generative models that unify various approaches, significantly improving efficiency and effectiveness in large-scale text classification and retrieval tasks.

Contribution

It presents a unifying formalization of generative text models and introduces sparse computation techniques, enabling scalable and effective text mining across multiple tasks.

Findings

01

Matches or outperforms leading task-specific methods

02

Reduces classification times by an order of magnitude

03

Achieved top positions in Kaggle competitions

Abstract

The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Topic Modeling · Text and Document Classification Technologies