A Task-Oriented Evaluation Framework for Text Normalization in Modern NLP Pipelines
Md Abdullah Al Kafi, Raka Moni, Sumit Kumar Banshal

TL;DR
This paper introduces a comprehensive, task-oriented framework for evaluating text stemming methods in NLP, considering utility, downstream impact, and semantic preservation, demonstrated through comparison of Bangla and English stemmers.
Contribution
It proposes a novel evaluation framework combining effectiveness, downstream task impact, and semantic similarity, addressing limitations of existing methods.
Findings
Bangla stemmer has high effectiveness but risks over-stemming.
English stemmer balances effectiveness and semantic safety.
The framework helps identify reliable stemming methods for NLP tasks.
Abstract
Text normalization is an essential preprocessing step in many natural language processing (NLP) tasks, and stemming is one such normalization technique that reduces words to their base or root form. However, evaluating stemming methods is challenging because current evaluation approaches are limited and do not capture the potential harm caused by excessive stemming; therefore, it is essential to develop new approaches to evaluate stemming methods. To address this issue, this study propose a novel, task-oriented approach to evaluate stemming methods, which considers three aspects: (1) the utility of stemming using Stemming Effectiveness Score (SES), (2) the impact of stemming on downstream tasks using Model Performance Delta (MPD), and (3) the semantic similarity between stemmed and original words using Average Normalized Levenshtein Distance (ANLD), thus providing a comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Mental Health via Writing
