State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition

Aref Farhadipour; Homayoon Beigi; Volker Dellwo; Hadi Veisi

arXiv:2506.16969·eess.AS·June 30, 2025

State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition

Aref Farhadipour, Homayoon Beigi, Volker Dellwo, Hadi Veisi

PDF

Open Access

TL;DR

This paper introduces a Mamba-based state-space model combined with fine-tuned self-supervised models to improve whispered and multi-dialect speech recognition, achieving state-of-the-art results efficiently.

Contribution

It presents a novel Mamba-based state-space approach integrated with self-supervised models for efficient whispered and dialect speech recognition, outperforming previous methods.

Findings

01

Achieved best performance on wTIMIT and CHAINS datasets.

02

Model works efficiently with low whispered data.

03

Effective across multiple dialects and speech types.

Abstract

Whispered speech recognition presents significant challenges for conventional automatic speech recognition systems, particularly when combined with dialect variation. However, utilizing an efficient method to solve this problem using a low-range dataset and processing load is beneficial. This paper proposes a solution using a Mamba-based state-space model and four fine-tuned self-supervised models consisting of Wav2Vec2, WavLM, HuBERT, and Whisper to address the dual challenges of whispered speech and dialect diversity. Based on our knowledge, this represents the best performance reported on the wTIMIT and CHAINS datasets for whispered speech recognition. We trained the models using whispered and normal speech data across Singaporean, US, and Irish dialects. The findings demonstrated that utilizing the proposed Mamba-based model could work as a highly efficient model trained with low…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Authorship Attribution and Profiling