RideKE: Leveraging Low-Resource, User-Generated Twitter Content for Sentiment and Emotion Detection in Kenyan Code-Switched Dataset
Naome A. Etori, Maria L. Gini

TL;DR
This paper evaluates transformer-based models for sentiment and emotion detection in low-resource, code-switched Kenyan Twitter data, highlighting the effectiveness of XLM-R and DistilBERT in this challenging context.
Contribution
It introduces a methodology for collecting and annotating Kenyan code-switched Twitter data and compares multiple models, demonstrating the superior performance of XLM-R and DistilBERT in low-resource settings.
Findings
XLM-R outperforms other models in sentiment analysis.
DistilBERT achieves the best emotion classification accuracy.
All models tend to predict neutral sentiment, with AfriBERT showing bias.
Abstract
Social media has become a crucial open-access platform for individuals to express opinions and share experiences. However, leveraging low-resource language data from Twitter is challenging due to scarce, poor-quality content and the major variations in language use, such as slang and code-switching. Identifying tweets in these languages can be difficult as Twitter primarily supports high-resource languages. We analyze Kenyan code-switched data and evaluate four state-of-the-art (SOTA) transformer-based pretrained models for sentiment and emotion classification, using supervised and semi-supervised methods. We detail the methodology behind data collection and annotation, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2\%) and F1 score…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
