MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

Yinfeng Xia; Huiyan Li; Chenyang Le; Manhong Wang; Yutao Sun; Xingyang Ma; Yanmin Qian

arXiv:2506.03722·cs.CL·June 5, 2025

MFLA: Monotonic Finite Look-ahead Attention for Streaming Speech Recognition

Yinfeng Xia, Huiyan Li, Chenyang Le, Manhong Wang, Yutao Sun, Xingyang Ma, Yanmin Qian

PDF

Open Access

TL;DR

This paper introduces MFLA, a novel attention mechanism for streaming speech recognition that balances latency and accuracy by combining monotonic alignment with finite look-ahead attention, enabling efficient real-time transcription.

Contribution

It proposes Monotonic Finite Look-ahead Attention and a prefix-to-prefix training framework to improve streaming speech recognition with large pre-trained models like Whisper.

Findings

01

Achieves a controllable trade-off between latency and recognition quality.

02

Demonstrates effective quasi-monotonic alignment between speech and text.

03

Simplifies decoding with wait-k strategy while maintaining accuracy.

Abstract

Applying large pre-trained speech models like Whisper has shown promise in reducing training costs for various speech tasks. However, integrating these models into streaming systems remains a challenge. This paper presents a novel prefix-to-prefix training framework for streaming recognition by fine-tuning the Whisper. We introduce the Continuous Integrate-and-Fire mechanism to establish a quasi-monotonic alignment between continuous speech sequences and discrete text tokens. Additionally, we design Monotonic Finite Look-ahead Attention, allowing each token to attend to infinite left-context and finite right-context from the speech sequences. We also employ the wait-k decoding strategy to simplify the decoding process while ensuring consistency between training and testing. Our theoretical analysis and experiments demonstrate that this approach achieves a controllable trade-off between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need