Multi-Stage Training for Abusive Comment Detection in Indic Languages

Pranshu Rastogi; Madhav Mathur; Ramaneswaran S; Kshitij Mohan

arXiv:2605.22380·cs.CL·May 22, 2026

Multi-Stage Training for Abusive Comment Detection in Indic Languages

Pranshu Rastogi, Madhav Mathur, Ramaneswaran S, Kshitij Mohan

PDF

TL;DR

This paper presents a multi-stage training pipeline using language preprocessing and ensemble models to improve abusive comment detection in Indic languages, aiming to reduce false positives while maintaining free expression.

Contribution

It introduces a novel multi-stage training approach with language preprocessing and ensemble modeling tailored for Indic languages in abusive comment detection.

Findings

01

The proposed pipeline reduces false-positive rates significantly.

02

Ensemble models outperform individual classifiers in accuracy.

03

Extensive experiments validate the effectiveness of the approach.

Abstract

In recent years social media has become an increasingly popular tool for communication. People use it to share their ideas, exchange information, and discuss thoughts. Given its prevalence and widespread reach, social media must remain a safe space for people. Content generated on social media can be abusive and it has become increasingly important to detect such content. In this paper, we use a language-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection. Through extensive experimentation, we propose a pipeline that minimizes the false-positive rate (marking non-abusive as abusive) so that these systems can detect abusive comments without undermining the freedom of expression.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.