An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery
Michael R. Brent

TL;DR
This paper introduces a probabilistically sound, language-independent algorithm for segmenting text into words without supervision, based on a flexible model that considers the entire corpus as a single probabilistic event.
Contribution
It presents a novel, modular, probabilistic model for unsupervised word boundary detection that outperforms previous algorithms in specific speech segmentation tasks.
Findings
More effective than existing algorithms when boundaries are given
Performs well on short utterances in spontaneous speech
Uses a language-independent, corpus-wide probability model
Abstract
This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, word-order, and word frequency can be replaced in a modular fashion. The model yields a language-independent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Speech Recognition and Synthesis
