XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Shashi Kumar; Srikanth Madikeri; Juan Zuluaga-Gomez; Esa\'u; Villatoro-Tello; Iuliia Thorbecke; Petr Motlicek; Manjunath K E; Aravind; Ganapathiraju

arXiv:2407.04439·eess.AS·October 10, 2024·ICASSP·1 cites

XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models

Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esa\'u, Villatoro-Tello, Iuliia Thorbecke, Petr Motlicek, Manjunath K E, Aravind, Ganapathiraju

PDF

Open Access

TL;DR

This paper presents XLSR-Transducer, a streaming automatic speech recognition model leveraging self-supervised pretrained XLSR-53, with improved accuracy and reduced latency through attention masking and sinks, validated on multiple datasets.

Contribution

It introduces XLSR-Transducer, enabling streaming ASR with self-supervised pretrained models by applying attention masking and sinks, improving performance and reducing context.

Findings

01

XLSR-Transducer outperforms Whisper large-v2 and scratch-trained Zipformer models in WER.

02

Attention sinks reduce left context by half with a 12% WER improvement.

03

Validated on AMI and five languages from CommonVoice in low-resource scenarios.

Abstract

Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems

MethodsSoftmax · Attention Is All You Need