Unsupervised discovery of morphologically related words based on orthographic and semantic similarity
Marco Baroni, Johannes Matiasek, Harald Trost

TL;DR
This paper introduces an unsupervised algorithm that identifies morphologically related word pairs by combining orthographic and semantic similarity measures, without relying on traditional morphological models, demonstrated on English and German corpora.
Contribution
The proposed method uniquely combines orthographic and semantic similarity to discover morphological relations without relying on morpheme concatenation or substring distributional properties.
Findings
High precision in identifying true morphological pairs
Effective in both English and German languages
Qualitative analysis reveals diverse morphological patterns
Abstract
We present an algorithm that takes an unannotated corpus as its input, and returns a ranked list of probable morphologically related pairs as its output. The algorithm tries to discover morphologically related pairs by looking for pairs that are both orthographically and semantically similar, where orthographic similarity is measured in terms of minimum edit distance, and semantic similarity is measured in terms of mutual information. The procedure does not rely on a morpheme concatenation model, nor on distributional properties of word substrings (such as affix frequency). Experiments with German and English input give encouraging results, both in terms of precision (proportion of good pairs found at various cutoff points of the ranked list), and in terms of a qualitative analysis of the types of morphological patterns discovered by the algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
