Mamba for Streaming ASR Combined with Unimodal Aggregation
Ying Fang, Xiaofei Li

TL;DR
This paper introduces Mamba, a state space model for streaming ASR, enhanced with a lookahead mechanism and unimodal aggregation to improve accuracy and reduce latency in real-time speech recognition.
Contribution
It proposes a novel Mamba encoder with lookahead, a streaming unimodal aggregation method, and an early termination technique for efficient streaming ASR.
Findings
Achieves competitive accuracy on Mandarin datasets.
Reduces recognition latency through early termination.
Demonstrates efficiency of Mamba in streaming ASR tasks.
Abstract
This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT-based Smart Home Systems
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
