Quantitative Stopword Generation for Sentiment Analysis via Recursive and Iterative Deletion
Daniel M. DiPietro

TL;DR
This paper introduces a novel quantitative method using recursive and iterative deletion algorithms to generate effective stopword lists for sentiment analysis, significantly reducing dataset size with minimal impact on model performance.
Contribution
It presents a new quantitative approach for stopword generation that outperforms previous qualitative and statistical methods in sentiment analysis tasks.
Findings
Stopword lists reduced dataset size by up to 63.7%.
Model accuracy was maintained or improved despite dataset reduction.
The method is effective for creating task-specific stopword sets.
Abstract
Stopwords carry little semantic information and are often removed from text data to reduce dataset size and improve machine learning model performance. Consequently, researchers have sought to develop techniques for generating effective stopword sets. Previous approaches have ranged from qualitative techniques relying upon linguistic experts, to statistical approaches that extract word importance using correlations or frequency-dependent metrics computed on a corpus. We present a novel quantitative approach that employs iterative and recursive feature deletion algorithms to see which words can be deleted from a pre-trained transformer's vocabulary with the least degradation to its performance, specifically for the task of sentiment analysis. Empirically, stopword lists generated via this approach drastically reduce dataset size while negligibly impacting model performance, in one such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Recommender Systems and Techniques · Sentiment Analysis and Opinion Mining
MethodsLogistic Regression
