Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

Ahmed Alfey Sani; Kazi Akib Zaoad; Shefayat E Shams Adib; Md Abdul Muqtadir; Ajwad Abrar

arXiv:2605.01292·cs.CL·May 5, 2026

Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

Ahmed Alfey Sani, Kazi Akib Zaoad, Shefayat E Shams Adib, Md Abdul Muqtadir, Ajwad Abrar

PDF

TL;DR

This paper explores using LLM-based data augmentation to improve Bangla fake news detection, demonstrating significant performance gains and releasing a synthetic dataset for low-resource language research.

Contribution

It introduces a systematic LLM augmentation framework for Bangla fake news datasets, enhancing classification performance and supporting reproducibility with publicly released data and code.

Findings

01

Augmenting only the minority class with high augmentation rate improves F1 score from 0.85 to 0.88.

02

Random subsampling combined with augmentation yields the strongest gains.

03

Generated 4,545 synthetic Bangla fake news samples for research use.

Abstract

The growing spread of misinformation in digital media highlights the need for reliable fake news detection systems, yet progress in under-resourced languages such as Bangla is limited by small and imbalanced datasets. This study investigates whether Large Language Model (LLM) based augmentation can effectively address this limitation and improve Bangla fake news classification. Existing datasets remain valuable but highly imbalanced, limiting model performance, and LLM based augmentation for Bangla has been scarcely explored. To fill this gap, we propose a systematic augmentation framework that generates synthetic Bangla news articles using the instruction tuned Gemma 3 27B IT model, supported by semantic filtering and controlled subsampling to preserve label consistency and diversity. We compare zero shot and few shot prompting, evaluate multiple augmentation rates, and examine random…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.