An Efficient, Probabilistically Sound Algorithm for Segmentation and   Word Discovery

Michael R. Brent

arXiv:cs/9905007·cs.CL·May 23, 2007·208 cites

An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery

Michael R. Brent

PDF

Open Access 1 Repo

TL;DR

This paper introduces a probabilistically sound, language-independent algorithm for segmenting text into words without supervision, based on a flexible model that considers the entire corpus as a single probabilistic event.

Contribution

It presents a novel, modular, probabilistic model for unsupervised word boundary detection that outperforms previous algorithms in specific speech segmentation tasks.

Findings

01

More effective than existing algorithms when boundaries are given

02

Performs well on short utterances in spontaneous speech

03

Uses a language-independent, corpus-wide probability model

Abstract

This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, word-order, and word frequency can be replaced in a modular fashion. The model yields a language-independent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kamperh/dpdp_aernn
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Speech Recognition and Synthesis