XLSR-Transducer: Streaming ASR for Self-Supervised Pretrained Models
Shashi Kumar, Srikanth Madikeri, Juan Zuluaga-Gomez, Esa\'u, Villatoro-Tello, Iuliia Thorbecke, Petr Motlicek, Manjunath K E, Aravind, Ganapathiraju

TL;DR
This paper presents XLSR-Transducer, a streaming automatic speech recognition model leveraging self-supervised pretrained XLSR-53, with improved accuracy and reduced latency through attention masking and sinks, validated on multiple datasets.
Contribution
It introduces XLSR-Transducer, enabling streaming ASR with self-supervised pretrained models by applying attention masking and sinks, improving performance and reducing context.
Findings
XLSR-Transducer outperforms Whisper large-v2 and scratch-trained Zipformer models in WER.
Attention sinks reduce left context by half with a 12% WER improvement.
Validated on AMI and five languages from CommonVoice in low-resource scenarios.
Abstract
Self-supervised pretrained models exhibit competitive performance in automatic speech recognition on finetuning, even with limited in-domain supervised data. However, popular pretrained models are not suitable for streaming ASR because they are trained with full attention context. In this paper, we introduce XLSR-Transducer, where the XLSR-53 model is used as encoder in transducer setup. Our experiments on the AMI dataset reveal that the XLSR-Transducer achieves 4% absolute WER improvement over Whisper large-v2 and 8% over a Zipformer transducer model trained from scratch. To enable streaming capabilities, we investigate different attention masking patterns in the self-attention computation of transformer layers within the XLSR-53 model. We validate XLSR-Transducer on AMI and 5 languages from CommonVoice under low-resource scenarios. Finally, with the introduction of attention sinks, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems
MethodsSoftmax · Attention Is All You Need
