Text Segmentation based on Semantic Word Embeddings
Alexander A Alemi, Paul Ginsparg

TL;DR
This paper introduces a novel text segmentation method leveraging semantic word embeddings, achieving state-of-the-art results and demonstrating versatility on scholarly article datasets.
Contribution
It presents a new framework for segmentation objectives, compares greedy and exact optimization, and introduces Content Vector Segmentation (CVS) with superior performance.
Findings
State-of-the-art performance with CVS on Choi test set
Effective iterative refinement improves greedy strategies
Successful application to scholarly article texts
Abstract
We explore the use of semantic word embeddings in text segmentation algorithms, including the C99 segmentation algorithm and new algorithms inspired by the distributed word vector representation. By developing a general framework for discussing a class of segmentation objectives, we study the effectiveness of greedy versus exact optimization approaches and suggest a new iterative refinement technique for improving the performance of greedy strategies. We compare our results to known benchmarks, using known metrics. We demonstrate state-of-the-art performance for an untrained method with our Content Vector Segmentation (CVS) on the Choi test set. Finally, we apply the segmentation procedure to an in-the-wild dataset consisting of text extracted from scholarly articles in the arXiv.org database.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies
