Multi-Paragraph Segmentation of Expository Text
Marti A. Hearst (UC Berkeley, Xerox PARC)

TL;DR
This paper introduces TextTiling, an algorithm that segments expository texts into coherent multi-paragraph units based on subtopic structure, using lexical frequency and distribution analysis.
Contribution
The paper presents a novel, domain-independent algorithm for text segmentation that effectively identifies subtopic boundaries in lengthy expository texts.
Findings
Segmentation aligns well with human judgments
Two algorithm versions demonstrate effectiveness
Applicable to lengthy expository texts
Abstract
This paper describes TextTiling, an algorithm for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. The algorithm uses domain-independent lexical frequency and distribution information to recognize the interactions of multiple simultaneous themes. Two fully-implemented versions of the algorithm are described and shown to produce segmentation that corresponds well to human judgments of the major subtopic boundaries of thirteen lengthy texts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
