Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition

Md. Nazmus Sakib; Golam Mahmud; Md. Maruf Bangabashi; Umme Ara Mahinur Istia; Md. Jahidul Islam; Partha Sarker; Afra Yeamini Prity

arXiv:2601.09710·eess.AS·January 16, 2026

Multi-Level Embedding Conformer Framework for Bengali Automatic Speech Recognition

Md. Nazmus Sakib, Golam Mahmud, Md. Maruf Bangabashi, Umme Ara Mahinur Istia, Md. Jahidul Islam, Partha Sarker, Afra Yeamini Prity

PDF

Open Access

TL;DR

This paper introduces a multi-level embedding Conformer framework for Bengali ASR, integrating phoneme, syllable, and wordpiece information to improve recognition accuracy in a low-resource language.

Contribution

It proposes a novel multi-level embedding fusion mechanism within a Conformer-CTC model specifically designed for Bengali ASR, enhancing phonetic and contextual feature capture.

Findings

01

Achieved a WER of 10.01% on Bengali speech data.

02

Demonstrated the effectiveness of multi-granular linguistic embeddings.

03

Showed improved recognition performance over baseline models.

Abstract

Bengali, spoken by over 300 million people, is a morphologically rich and lowresource language, posing challenges for automatic speech recognition (ASR). This research presents an end-to-end framework for Bengali ASR, building on a Conformer-CTC backbone with a multi-level embedding fusion mechanism that incorporates phoneme, syllable, and wordpiece representations. By enriching acoustic features with these linguistic embeddings, the model captures fine-grained phonetic cues and higher-level contextual patterns. The architecture employs early and late Conformer stages, with preprocessing steps including silence trimming, resampling, Log-Mel spectrogram extraction, and SpecAugment augmentation. The experimental results demonstrate the strong potential of the model, achieving a word error rate (WER) of 10.01% and a character error rate (CER) of 5.03%. These results demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders