Text Segmentation Using Exponential Models

Doug Beeferman; Adam Berger; John Lafferty (Carnegie Mellon)

arXiv:cmp-lg/9706016·cmp-lg·February 3, 2008·85 cites

Text Segmentation Using Exponential Models

Doug Beeferman, Adam Berger, John Lafferty (Carnegie Mellon)

PDF

Open Access

TL;DR

This paper presents a novel statistical text segmentation method combining short- and long-range language models with lexical hints, evaluated using a new error metric across diverse news domains.

Contribution

It introduces a new probabilistic approach for text segmentation that integrates multiple language models and lexical cues, along with a novel evaluation metric.

Findings

01

Effective segmentation on news articles and transcripts

02

Outperforms traditional precision/recall metrics

03

Demonstrates robustness across different domains

Abstract

This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies