Monotonic segmental attention for automatic speech recognition

Albert Zeyer; Robin Schmitt; Wei Zhou; Ralf Schl\"uter; Hermann Ney

arXiv:2210.14742·cs.CL·October 27, 2022

Monotonic segmental attention for automatic speech recognition

Albert Zeyer, Robin Schmitt, Wei Zhou, Ralf Schl\"uter, Hermann Ney

PDF

Open Access 1 Repo

TL;DR

This paper presents a new segmental-attention model for automatic speech recognition that improves efficiency, generalizes better to long sequences, and enables streaming by restricting attention to segments.

Contribution

It introduces a novel segmental-attention mechanism and a time-synchronous decoding approach, advancing streaming ASR and outperforming global-attention models.

Findings

01

Segmental-attention outperforms global-attention in accuracy.

02

Segmental model generalizes better to long sequences.

03

Time-synchronous decoding enhances streaming capabilities.

Abstract

We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, one specifically taking the segmental nature into account, yielding further improvements. Using time-synchronous decoding for segmental models is novel and a step towards streaming applications. Our experiments show the importance of a length model to predict the segment boundaries. The final best segmental-attention model using segmental decoding performs better than global-attention, in contrast to other monotonic attention approaches in the literature. Further, we observe that the segmental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rwth-i6/returnn-experiments
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing