SS4MCT: A Statistical Stemmer for Morphologically Complex Texts
Javid Dadashkarimi, Hossein Nasr Esfahani, Heshaam Faili, and Azadeh, Shakery

TL;DR
This paper introduces SS4MCT, a statistical stemmer designed for morphologically complex texts that effectively identifies affixes, including infixes, to improve stemming accuracy in highly inflected languages.
Contribution
The paper presents a novel statistical method for finding affixes and inflectional rules, including infixes, based on minimum edit distance and rule likelihoods, enhancing stemming in complex texts.
Findings
Significantly outperforms baselines in MAP on CLEF tasks
Effectively identifies infixes in irregular inflections
Improves stemming accuracy in morphologically complex languages
Abstract
There have been multiple attempts to resolve various inflection matching problems in information retrieval. Stemming is a common approach to this end. Among many techniques for stemming, statistical stemming has been shown to be effective in a number of languages, particularly highly inflected languages. In this paper we propose a method for finding affixes in different positions of a word. Common statistical techniques heavily rely on string similarity in terms of prefix and suffix matching. Since infixes are common in irregular/informal inflections in morphologically complex texts, it is required to find infixes for stemming. In this paper we propose a method whose aim is to find statistical inflectional rules based on minimum edit distance table of word pairs and the likelihoods of the rules in a language. These rules are used to statistically stem words and can be used in different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
