Text Segmentation Using Exponential Models
Doug Beeferman, Adam Berger, John Lafferty (Carnegie Mellon)

TL;DR
This paper presents a novel statistical text segmentation method combining short- and long-range language models with lexical hints, evaluated using a new error metric across diverse news domains.
Contribution
It introduces a new probabilistic approach for text segmentation that integrates multiple language models and lexical cues, along with a novel evaluation metric.
Findings
Effective segmentation on news articles and transcripts
Outperforms traditional precision/recall metrics
Demonstrates robustness across different domains
Abstract
This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
