Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning
Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Zahid Hossain, Md. Kamrozzaman Bhuiyan, Farhad Uz Zaman

TL;DR
This paper evaluates zero-shot and fine-tuned deep learning models for detecting Bengali deepfake audio, demonstrating that fine-tuning substantially improves detection accuracy in a low-resource language setting.
Contribution
It provides the first systematic benchmark of Bengali deepfake audio detection and compares zero-shot and fine-tuned models, highlighting the effectiveness of transfer learning.
Findings
Zero-shot models have limited detection ability, with best accuracy around 54%.
Fine-tuned models significantly outperform zero-shot, with ResNet18 achieving nearly 80% accuracy.
Fine-tuning enhances deepfake detection performance in Bengali, a low-resource language.
Abstract
The rapid growth of speech synthesis and voice conversion systems has made deepfake audio a major security concern. Bengali deepfake detection remains largely unexplored. In this work, we study automatic detection of Bengali audio deepfakes using the BanglaFake dataset. We evaluate zeroshot inference with several pretrained models. These include Wav2Vec2-XLSR-53, Whisper, PANNsCNN14, WavLM and Audio Spectrogram Transformer. Zero-shot results show limited detection ability. The best model, Wav2Vec2-XLSR-53, achieves 53.80% accuracy, 56.60% AUC and 46.20% EER. We then f ine-tune multiple architectures for Bengali deepfake detection. These include Wav2Vec2-Base, LCNN, LCNN-Attention, ResNet18, ViT-B16 and CNN-BiLSTM. Fine-tuned models show strong performance gains. ResNet18 achieves the highest accuracy of 79.17%, F1 score of 79.12%, AUC of 84.37% and EER of 24.35%. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
