Unsupervised Discovery of Morphemes

Mathias Creutz; Krista Lagus

arXiv:cs/0205057·cs.CL·May 23, 2007·23 cites

Unsupervised Discovery of Morphemes

Mathias Creutz, Krista Lagus

PDF

Open Access

TL;DR

This paper introduces two unsupervised methods for segmenting words into morpheme-like units, particularly effective for morphologically rich languages, and demonstrates their competitive performance against existing systems.

Contribution

It proposes novel unsupervised segmentation techniques based on MDL and ML principles tailored for languages with complex morphology.

Findings

01

Methods perform well on Finnish and English corpora

02

Competitive with current state-of-the-art systems

03

Effective for languages with rich morphology

Abstract

We present two methods for unsupervised segmentation of words into morpheme-like units. The model utilized is especially suited for languages with a rich morphology, such as Finnish. The first method is based on the Minimum Description Length (MDL) principle and works online. In the second method, Maximum Likelihood (ML) optimization is used. The quality of the segmentations is measured using an evaluation method that compares the segmentations produced to an existing morphological analysis. Experiments on both Finnish and English corpora show that the presented methods perform well compared to a current state-of-the-art system.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Web Data Mining and Analysis