BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition

Md Sazzadul Islam Ridoy; Mubaswira Ibnat Zidney; Sumi Akter; Md. Aminur Rahman

arXiv:2601.17679·cs.SD·January 27, 2026

BanglaRobustNet: A Hybrid Denoising-Attention Architecture for Robust Bangla Speech Recognition

Md Sazzadul Islam Ridoy, Mubaswira Ibnat Zidney, Sumi Akter, Md. Aminur Rahman

PDF

Open Access

TL;DR

BanglaRobustNet is a novel hybrid denoising-attention architecture that significantly improves Bangla speech recognition accuracy in noisy and speaker-diverse conditions by integrating noise suppression and speaker context modeling.

Contribution

It introduces a diffusion-based denoising module combined with cross-attention for speaker adaptation, tailored specifically for robust Bangla ASR in challenging environments.

Findings

01

Reduces word error rate (WER) and character error rate (CER) compared to baseline models.

02

Effective in noisy, speaker-diverse conditions on Mozilla Common Voice Bangla dataset.

03

Establishes a new benchmark for robust Bangla speech recognition.

Abstract

Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition (ASR) research, particularly under noisy and speaker-diverse conditions. This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT, designed to address these challenges. The architecture integrates a diffusion-based denoising module to suppress environmental noise while preserving Bangla-specific phonetic cues, and a contextual cross-attention module that conditions recognition on speaker embeddings for robustness across gender, age, and dialects. Trained end-to-end with a composite objective combining CTC loss, phonetic consistency, and speaker alignment, BanglaRobustNet achieves substantial reductions in word error rate (WER) and character error rate (CER) compared to Wav2Vec-BERT and Whisper baselines. Evaluations on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders