SOLD: Sinhala Offensive Language Dataset

Tharindu Ranasinghe; Isuri Anuradha; Damith Premasiri; Kanishka Silva,; Hansi Hettiarachchi; Lasitha Uyangodage; Marcos Zampieri

arXiv:2212.00851·cs.CL·March 29, 2024·5 cites

SOLD: Sinhala Offensive Language Dataset

Tharindu Ranasinghe, Isuri Anuradha, Damith Premasiri, Kanishka Silva,, Hansi Hettiarachchi, Lasitha Uyangodage, Marcos Zampieri

PDF

Open Access 1 Repo 3 Models 3 Datasets

TL;DR

This paper introduces SOLD, the first large publicly available dataset for offensive language detection in Sinhala, including sentence and token-level annotations, and presents experiments demonstrating its utility for NLP tasks.

Contribution

It provides the first comprehensive Sinhala offensive language dataset, including a semi-supervised larger dataset, advancing research in low-resource language offensive content detection.

Findings

01

SOLD contains 10,000 annotated posts for offensive language detection.

02

SemiSOLD includes over 145,000 tweets using semi-supervised annotation.

03

The datasets improve model explainability and facilitate NLP research in Sinhala.

Abstract

The widespread of offensive content online, such as hate speech and cyber-bullying, is a global phenomenon. This has sparked interest in the artificial intelligence (AI) and natural language processing (NLP) communities, motivating the development of various systems trained to detect potentially harmful content automatically. These systems require annotated datasets to train the machine learning (ML) models. However, with a few notable exceptions, most datasets on this topic have dealt with English and a few other high-resource languages. As a result, the research in offensive language identification has been limited to these languages. This paper addresses this gap by tackling offensive language identification in Sinhala, a low-resource Indo-Aryan language spoken by over 17 million people in Sri Lanka. We introduce the Sinhala Offensive Language Dataset (SOLD) and present multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sinhala-nlp/sold
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection