BanglaNirTox: A Large-scale Parallel Corpus for Explainable AI in Bengali Text Detoxification
Ayesha Afroza Mohsin, Mashrur Ahsan, Nafisa Maliyat, Shanta Maria, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

TL;DR
This paper introduces BanglaNirTox, a large-scale parallel corpus for Bengali text detoxification, and demonstrates how Pareto-optimized LLMs with Chain-of-Thought prompting improve detoxification quality.
Contribution
It presents a novel pipeline combining Pareto-optimized LLMs and CoT prompting, along with a new dataset for Bengali text detoxification, addressing resource scarcity in low-resource language detoxification.
Findings
Pareto-optimized LLMs with CoT improve detoxification quality
The BanglaNirTox dataset contains 68,041 toxic sentences with labels and detoxified paraphrases
Enhanced consistency in Bengali text detoxification results
Abstract
Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Topic Modeling · Text Readability and Simplification
