Using Machine Learning to Detect Fraudulent SMSs in Chichewa
Amelia Taylor, Amoss Robert

TL;DR
This study introduces the first dataset for SMS fraud detection in Chichewa, demonstrating that machine learning models can achieve high accuracy, but performance drops with translation, emphasizing the need for language-specific models.
Contribution
The paper creates and evaluates the first Chichewa SMS fraud dataset and assesses the feasibility of machine learning models for this language, highlighting challenges in multilingual NLP.
Findings
Models achieved over 96% accuracy on Chichewa data
Performance declined when using translated datasets
Data preprocessing impacts model effectiveness in multilingual settings
Abstract
SMS enabled fraud is of great concern globally. Building classifiers based on machine learning for SMS fraud requires the use of suitable datasets for model training and validation. Most research has centred on the use of datasets of SMSs in English. This paper introduces a first dataset for SMS fraud detection in Chichewa, a major language in Africa, and reports on experiments with machine learning algorithms for classifying SMSs in Chichewa as fraud or non-fraud. We answer the broader research question of how feasible it is to develop machine learning classification models for Chichewa SMSs. To do that, we created three datasets. A small dataset of SMS in Chichewa was collected through primary research from a segment of the young population. We applied a label-preserving text transformations to increase its size. The enlarged dataset was translated into English using two approaches:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Cybercrime and Law Enforcement Studies
