Concise comparative summaries (CCS) of large text corpora with a human experiment
Jinzhu Jia, Luke Miratrix, Bin Yu, Brian Gawalt, Laurent El Ghaoui,, Luke Barnesmoore, Sophie Clavier

TL;DR
This paper introduces a flexible, sparse classification-based framework called CCS for topic-specific summarization of large text corpora, validated through human surveys and case studies on news articles.
Contribution
The paper presents CCS, a novel lightweight summarization method using sparse classification, bridging simple frequency methods and complex models like LDA, with validation through human experiments.
Findings
Lasso with L2 normalization effectively summarizes large corpora.
CCS provides meaningful summaries comparable to human understanding.
Case studies demonstrate CCS's utility in media analysis and coverage comparison.
Abstract
In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). We argue that sparse methods have much to offer for text analysis and hope CCS opens the door for a new branch of research in this important field. For a particular topic of interest (e.g., China or energy), CSS automatically labels documents as being either on- or off-topic (usually via keyword search), and then uses sparse classification methods to predict these labels with the high-dimensional counts of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
