N-gram Statistical Stemmer for Bangla Corpus
Rabeya Sadia, Md Ataur Rahman, Md Hanif Seddiqui

TL;DR
This paper introduces an N-gram based stemming algorithm for Bangla that clusters related words using the dice coefficient, achieving approximately 87% accuracy in identifying stems.
Contribution
The study proposes a novel N-gram stemming approach for Bangla that improves upon previous suffix removal methods by using clustering techniques for more accurate stemming.
Findings
Achieved around 87% accurate clustering of related words.
Demonstrated effectiveness of N-gram stemming in Bangla.
Compared clustering algorithms with promising results.
Abstract
Stemming is a process that can be utilized to trim inflected words to stem or root form. It is useful for enhancing the retrieval effectiveness, especially for text search in order to solve the mismatch problems. Previous research on Bangla stemming mostly relied on eliminating multiple suffixes from a solitary word through a recursive rule based procedure to recover progressively applicable relative root. Our proposed system has enhanced the aforementioned exploration by actualizing one of the stemming algorithms called N-gram stemming. By utilizing an affiliation measure called dice coefficient, related sets of words are clustered depending on their character structure. The smallest word in one cluster may be considered as the stem. We additionally analyzed Affinity Propagation clustering algorithms with coefficient similarity as well as with median similarity. Our result indicates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
