TL;DR
This paper investigates Mamba-based HuBERT models for speech self-supervised learning, demonstrating their efficiency and superior performance in long-context ASR, streaming, and speech representation tasks.
Contribution
It introduces Mamba-based HuBERT models as efficient alternatives to Transformer SSL architectures, with improved performance and lower computational costs.
Findings
Mamba-based models outperform Transformer models in streaming ASR.
They produce higher-quality quantized speech representations.
Models show competitive results on SUPERB benchmarks.
Abstract
While Mamba has demonstrated strong performance in language modeling, its potential as a speech self-supervised learning (SSL) model remains underexplored, with prior studies limited to isolated tasks. To address this, we explore Mamba-based HuBERT models as alternatives to Transformer-based SSL architectures. Leveraging the linear-time Selective State Space, these models enable fine-tuning on long-context ASR with significantly lower compute. Moreover, they show superior performance when fine-tuned for streaming ASR. Beyond fine-tuning, these models show competitive performance on SUPERB probing benchmarks, particularly in causal settings. Our analysis shows that they yield higher-quality quantized representations and capture speaker-related features more distinctly than Transformer-based models. These findings highlight Mamba-based SSL as a promising and complementary direction for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
