SOLD: Sinhala Offensive Language Dataset
Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva,, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri

TL;DR
This paper introduces SOLD, the first large publicly available dataset for offensive language detection in Sinhala, including sentence and token-level annotations, and presents experiments demonstrating its utility for NLP tasks.
Contribution
It provides the first comprehensive Sinhala offensive language dataset, including a semi-supervised larger dataset, advancing research in low-resource language offensive content detection.
Findings
SOLD contains 10,000 annotated posts for offensive language detection.
SemiSOLD includes over 145,000 tweets using semi-supervised annotation.
The datasets improve model explainability and facilitate NLP research in Sinhala.
Abstract
The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
