Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning

Most. Sharmin Sultana Samu; Md. Rakibul Islam; Md. Zahid Hossain; Md. Kamrozzaman Bhuiyan; Farhad Uz Zaman

arXiv:2512.21702·cs.SD·December 29, 2025

Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning

Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Zahid Hossain, Md. Kamrozzaman Bhuiyan, Farhad Uz Zaman

PDF

Open Access

TL;DR

This paper evaluates zero-shot and fine-tuned deep learning models for detecting Bengali deepfake audio, demonstrating that fine-tuning substantially improves detection accuracy in a low-resource language setting.

Contribution

It provides the first systematic benchmark of Bengali deepfake audio detection and compares zero-shot and fine-tuned models, highlighting the effectiveness of transfer learning.

Findings

01

Zero-shot models have limited detection ability, with best accuracy around 54%.

02

Fine-tuned models significantly outperform zero-shot, with ResNet18 achieving nearly 80% accuracy.

03

Fine-tuning enhances deepfake detection performance in Bengali, a low-resource language.

Abstract

The rapid growth of speech synthesis and voice conversion systems has made deepfake audio a major security concern. Bengali deepfake detection remains largely unexplored. In this work, we study automatic detection of Bengali audio deepfakes using the BanglaFake dataset. We evaluate zeroshot inference with several pretrained models. These include Wav2Vec2-XLSR-53, Whisper, PANNsCNN14, WavLM and Audio Spectrogram Transformer. Zero-shot results show limited detection ability. The best model, Wav2Vec2-XLSR-53, achieves 53.80% accuracy, 56.60% AUC and 46.20% EER. We then f ine-tune multiple architectures for Bengali deepfake detection. These include Wav2Vec2-Base, LCNN, LCNN-Attention, ResNet18, ViT-B16 and CNN-BiLSTM. Fine-tuned models show strong performance gains. ResNet18 achieves the highest accuracy of 79.17%, F1 score of 79.12%, AUC of 84.37% and EER of 24.35%. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing