Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification

Anusha M D; Deepthi Vikram; Bharathi Raja Chakravarthi; Parameshwar R Hegde

arXiv:2508.11166·cs.CL·August 18, 2025

Overcoming Low-Resource Barriers in Tulu: Neural Models and Corpus Creation for OffensiveLanguage Identification

Anusha M D, Deepthi Vikram, Bharathi Raja Chakravarthi, Parameshwar R Hegde

PDF

TL;DR

This paper introduces the first benchmark dataset for offensive language detection in low-resource Tulu social media content, evaluates deep learning models, and highlights challenges faced by transformer models in such contexts.

Contribution

It provides a new annotated dataset for Tulu offensive language identification and benchmarks various neural models, revealing insights into model performance on low-resource, code-mixed languages.

Findings

01

BiGRU with self-attention achieves 82% accuracy

02

Transformer models underperform in code-mixed Tulu

03

High inter-annotator agreement (Krippendorff's alpha = 0.984)

Abstract

Tulu, a low-resource Dravidian language predominantly spoken in southern India, has limited computational resources despite its growing digital presence. This study presents the first benchmark dataset for Offensive Language Identification (OLI) in code-mixed Tulu social media content, collected from YouTube comments across various domains. The dataset, annotated with high inter-annotator agreement (Krippendorff's alpha = 0.984), includes 3,845 comments categorized into four classes: Not Offensive, Not Tulu, Offensive Untargeted, and Offensive Targeted. We evaluate a suite of deep learning models, including GRU, LSTM, BiGRU, BiLSTM, CNN, and attention-based variants, alongside transformer architectures (mBERT, XLM-RoBERTa). The BiGRU model with self-attention achieves the best performance with 82% accuracy and a 0.81 macro F1-score. Transformer models underperform, highlighting the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.