Parsimonious Morpheme Segmentation with an Application to Enriching Word Embeddings
Ahmed El-Kishky, Frank Xu, Aston Zhang, Jiawei Han

TL;DR
This paper introduces MorphMine, an unsupervised method for morpheme segmentation that improves word embeddings by leveraging subword structures, leading to better semantic representations especially for rare and out-of-vocabulary words.
Contribution
MorphMine is a novel, parsimonious hierarchical morpheme segmentation method that enhances word embedding quality across multiple languages and tasks.
Findings
MorphMine segments words into human-verified morphemes across languages.
Enriching embeddings with MorphMine morphemes improves performance on evaluation tasks.
The method effectively handles infrequent and out-of-vocabulary words.
Abstract
Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word-level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to not only poor embeddings for infrequent words in long-tailed text corpora but also weak capabilities for handling out-of-vocabulary words. In this paper we propose MorphMine for unsupervised morpheme segmentation. MorphMine applies a parsimony criterion to hierarchically segment words into the fewest number of morphemes at each level of the hierarchy. This leads to longer shared morphemes at each level of segmentation. Experiments show that MorphMine segments words in a variety of languages into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
