BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Ayesha Afroza Mohsin; Mashrur Ahsan; Nafisa Maliyat; Shanta Maria; Syed Rifat Raiyan; Hasan Mahmud; Md Kamrul Hasan

arXiv:2511.01512·cs.CL·November 4, 2025

BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification

Ayesha Afroza Mohsin, Mashrur Ahsan, Nafisa Maliyat, Shanta Maria, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

PDF

Open Access

TL;DR

This paper introduces BanglaNirTox, a large-scale parallel corpus for Bengali text detoxification, and demonstrates how Pareto-optimized LLMs with Chain-of-Thought prompting improve detoxification quality.

Contribution

It presents a novel pipeline combining Pareto-optimized LLMs and CoT prompting, along with a new dataset for Bengali text detoxification, addressing resource scarcity in low-resource language detoxification.

Findings

01

Pareto-optimized LLMs with CoT improve detoxification quality

02

The BanglaNirTox dataset contains 68,041 toxic sentences with labels and detoxified paraphrases

03

Enhanced consistency in Bengali text detoxification results

Abstract

Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Text Readability and Simplification