IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

Lucky Susanto; Musa Izzanardi Wijanarko; Prasetia Anugrah Pratama; Traci Hong; Ika Idris; Alham Fikri Aji; Derry Wijaya

arXiv:2406.19349·cs.CL·June 13, 2025

IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language

Lucky Susanto, Musa Izzanardi Wijanarko, Prasetia Anugrah Pratama, Traci Hong, Ika Idris, Alham Fikri Aji, Derry Wijaya

PDF

Open Access 7 Models 2 Datasets

TL;DR

This paper introduces IndoToxic2024, a large, annotated dataset of Indonesian hate speech targeting vulnerable groups, and demonstrates baseline classification results and the impact of demographic info on model performance.

Contribution

The creation of IndoToxic2024, a comprehensive, demographically-enriched Indonesian hate speech dataset, and the evaluation of baseline models and demographic information integration.

Findings

01

Achieved a macro-F1 score of 0.78 with IndoBERTweet on hate speech classification.

02

Demographic information improves zero-shot performance of gpt-3.5-turbo.

03

Overemphasis on demographic data can reduce fine-tuned model performance.

Abstract

Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Linguistics and Language Analysis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Residual Connection · Weight Decay · Softmax · Layer Normalization · Attention Dropout · Linear Warmup With Linear Decay · Dropout