Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability
Luke Miratrix, Robin Ackerman

TL;DR
This paper introduces a flexible, sparse classification framework for summarizing large text corpora by identifying key phrases of arbitrary length, enhancing interpretability and efficiency over traditional methods.
Contribution
It presents a novel, scalable approach for phrase-based summarization using sparse regression with branch-and-bound, applicable to diverse text analysis contexts.
Findings
The method efficiently identifies relevant phrases of arbitrary length.
Compared to existing methods, it offers faster computation and more interpretable summaries.
The approach is demonstrated with real-world datasets and a new R package, textreg.
Abstract
We propose a general framework for topic-specific summarization of large text corpora, and illustrate how it can be used for analysis in two quite different contexts: an OSHA database of fatality and catastrophe reports (to facilitate surveillance for patterns in circumstances leading to injury or death) and legal decisions on workers' compensation claims (to explore relevant case law). Our summarization framework, built on sparse classification methods, is a compromise between simple word frequency based methods currently in wide use, and more heavyweight, model-intensive methods such as Latent Dirichlet Allocation (LDA). For a particular topic of interest (e.g., mental health disability, or chemical reactions), we regress a labeling of documents onto the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
