Monotonic segmental attention for automatic speech recognition
Albert Zeyer, Robin Schmitt, Wei Zhou, Ralf Schl\"uter, Hermann Ney

TL;DR
This paper presents a new segmental-attention model for automatic speech recognition that improves efficiency, generalizes better to long sequences, and enables streaming by restricting attention to segments.
Contribution
It introduces a novel segmental-attention mechanism and a time-synchronous decoding approach, advancing streaming ASR and outperforming global-attention models.
Findings
Segmental-attention outperforms global-attention in accuracy.
Segmental model generalizes better to long sequences.
Time-synchronous decoding enhances streaming capabilities.
Abstract
We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, one specifically taking the segmental nature into account, yielding further improvements. Using time-synchronous decoding for segmental models is novel and a step towards streaming applications. Our experiments show the importance of a length model to predict the segment boundaries. The final best segmental-attention model using segmental decoding performs better than global-attention, in contrast to other monotonic attention approaches in the literature. Further, we observe that the segmental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
