Corpora Preparation and Stopword List Generation for Arabic data in Social Network
Walaa Medhat, Ahmed H. Yousef, Hoda Korashy

TL;DR
This paper develops a methodology for preparing Arabic social media corpora and generating dialect-specific stopword lists, demonstrating improved sentiment analysis performance when using dialect-aware stopword removal.
Contribution
It introduces a novel approach for creating Egyptian dialect stopword lists from social media data and evaluates their impact on sentiment analysis accuracy.
Findings
Dialect-specific stopword lists improve classification performance
Combining MSA and dialect stopwords yields better results
Unigram features outperform bigram in classification accuracy
Abstract
This paper proposes a methodology to prepare corpora in Arabic language from online social network (OSN) and review site for Sentiment Analysis (SA) task. The paper also proposes a methodology for generating a stopword list from the prepared corpora. The aim of the paper is to investigate the effect of removing stopwords on the SA task. The problem is that the stopwords lists generated before were on Modern Standard Arabic (MSA) which is not the common language used in OSN. We have generated a stopword list of Egyptian dialect and a corpus-based list to be used with the OSN corpora. We compare the efficiency of text classification when using the generated lists along with previously generated lists of MSA and combining the Egyptian dialect list with the MSA list. The text classification was performed using Na\"ive Bayes and Decision Tree classifiers and two feature selection approaches,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Text and Document Classification Technologies · Spam and Phishing Detection
