SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition
Desh Raj, Daniel Povey, Sanjeev Khudanpur

TL;DR
SURT 2.0 introduces significant improvements to multi-talker speech recognition by fixing previous limitations, enhancing efficiency, and achieving state-of-the-art results on multiple benchmarks with practical training methods.
Contribution
The paper presents modifications to the original SURT model, including new unmixing, encoding, and training strategies, enabling better performance and efficiency in multi-talker ASR.
Findings
Outperforms previous SURT on multiple benchmarks
Achieves low WERs on LibriCSS, AMI, and ICSI datasets
Enables training with standard academic resources
Abstract
The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors; (ii) it is computationally expensive, due to which it has not seen adoption in academia; and (iii) it has only been evaluated on synthetic mixtures. In this work, we propose several modifications to the original SURT which are carefully designed to fix the above limitations. In particular, we (i) change the unmixing module to a mask estimator that uses dual-path modeling, (ii) use a streaming zipformer encoder and a stateless decoder for the transducer, (iii) perform mixture simulation using force-aligned subsegments, (iv) pre-train the transducer on single-speaker data, (v)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
