From Documents to Segments: A Contextual Reformulation for Topic Assignment
Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-T\"ur, Stanley Jungkyu Choi

TL;DR
This paper introduces segment-based topic allocation (SBTA), a novel approach that assigns topics to text segments rather than entire documents, improving interpretability and analysis of multi-topic texts.
Contribution
It proposes SBTA, a new framework for fine-grained topic modeling at the segment level, supported by a new dataset and evaluation methods.
Findings
SBTA improves clustering quality across models
Segment-level evaluation enhances topical coherence assessment
The approach yields more interpretable topics in heterogeneous corpora
Abstract
Traditional topic modeling assigns a single topic to each document. In practice, however, many real-world documents, such as product reviews or open-ended survey responses, contain multiple distinct topics. This mismatch often leads to topic contamination, where unrelated themes are merged into a single topic, making it difficult to identify documents that truly focus on a specific subject. We address this issue by introducing segment-based topic allocation (SBTA), a reformulation of topic modeling that assigns topics not to entire documents, but to segments: short, coherent spans of text that each express a single theme. By modeling topical structure at the segment level, our approach yields cleaner and more interpretable topics and better supports analysis of multi-theme documents. To support systematic evaluation, we construct a SemEval-STM, a new dataset inspired by aspect-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
