Using Machine Learning to Enhance the Detection of Obfuscated Abusive Words in Swahili: A Focus on Child Safety
Phyllis Nabangi, Abdul-Jalil Zakaria, Jema David Ndibwile

TL;DR
This paper explores machine learning techniques to detect obfuscated abusive language in Swahili online content, aiming to improve child safety by addressing linguistic challenges in a low-resource language.
Contribution
It introduces the application of ML models like SVM, Logistic Regression, and Decision Trees to Swahili abuse detection, emphasizing data balancing and model optimization.
Findings
Models perform well with high-dimensional data
Data imbalance limits generalizability
Performance varies across models
Abstract
The rise of digital technology has dramatically increased the potential for cyberbullying and online abuse, necessitating enhanced measures for detection and prevention, especially among children. This study focuses on detecting abusive obfuscated language in Swahili, a low-resource language that poses unique challenges due to its limited linguistic resources and technological support. Swahili is chosen due to its popularity and being the most widely spoken language in Africa, with over 16 million native speakers and upwards of 100 million speakers in total, spanning regions in East Africa and some parts of the Middle East. We employed machine learning models including Support Vector Machines (SVM), Logistic Regression, and Decision Trees, optimized through rigorous parameter tuning and techniques like Synthetic Minority Over-sampling Technique (SMOTE) to handle data imbalance. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Authorship Attribution and Profiling · Bullying, Victimization, and Aggression
