N-gram Statistical Stemmer for Bangla Corpus

Rabeya Sadia; Md Ataur Rahman; Md Hanif Seddiqui

arXiv:1912.11612·cs.CL·December 30, 2019

N-gram Statistical Stemmer for Bangla Corpus

Rabeya Sadia, Md Ataur Rahman, Md Hanif Seddiqui

PDF

Open Access 1 Repo

TL;DR

This paper introduces an N-gram based stemming algorithm for Bangla that clusters related words using the dice coefficient, achieving approximately 87% accuracy in identifying stems.

Contribution

The study proposes a novel N-gram stemming approach for Bangla that improves upon previous suffix removal methods by using clustering techniques for more accurate stemming.

Findings

01

Achieved around 87% accurate clustering of related words.

02

Demonstrated effectiveness of N-gram stemming in Bangla.

03

Compared clustering algorithms with promising results.

Abstract

Stemming is a process that can be utilized to trim inflected words to stem or root form. It is useful for enhancing the retrieval effectiveness, especially for text search in order to solve the mismatch problems. Previous research on Bangla stemming mostly relied on eliminating multiple suffixes from a solitary word through a recursive rule based procedure to recover progressively applicable relative root. Our proposed system has enhanced the aforementioned exploration by actualizing one of the stemming algorithms called N-gram stemming. By utilizing an affiliation measure called dice coefficient, related sets of words are clustered depending on their character structure. The smallest word in one cluster may be considered as the stem. We additionally analyzed Affinity Propagation clustering algorithms with coefficient similarity as well as with median similarity. Our result indicates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shaoncsecu/Bangla_n-gram_Stemmer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies