Conducting sparse feature selection on arbitrarily long phrases in text   corpora with a focus on interpretability

Luke Miratrix; Robin Ackerman

arXiv:1511.06798·cs.CL·July 26, 2016

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

Luke Miratrix, Robin Ackerman

PDF

TL;DR

This paper introduces a flexible, sparse classification framework for summarizing large text corpora by identifying key phrases of arbitrary length, enhancing interpretability and efficiency over traditional methods.

Contribution

It presents a novel, scalable approach for phrase-based summarization using sparse regression with branch-and-bound, applicable to diverse text analysis contexts.

Findings

01

The method efficiently identifies relevant phrases of arbitrary length.

02

Compared to existing methods, it offers faster computation and more interpretable summaries.

03

The approach is demonstrated with real-world datasets and a new R package, textreg.

Abstract

We propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an OSHA database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death) and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency based methods currently in wide use, and more heavyweight, model-intensive methods such as Latent Dirichlet Allocation (LDA). For a particular topic of interest (e.g., mental health disability, or chemical reactions), we regress a labeling of documents onto the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.