SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

Desh Raj; Daniel Povey; Sanjeev Khudanpur

arXiv:2306.10559·eess.AS·September 20, 2023·1 cites

SURT 2.0: Advances in Transducer-based Multi-talker Speech Recognition

Desh Raj, Daniel Povey, Sanjeev Khudanpur

PDF

Open Access 1 Repo

TL;DR

SURT 2.0 introduces significant improvements to multi-talker speech recognition by fixing previous limitations, enhancing efficiency, and achieving state-of-the-art results on multiple benchmarks with practical training methods.

Contribution

The paper presents modifications to the original SURT model, including new unmixing, encoding, and training strategies, enabling better performance and efficiency in multi-talker ASR.

Findings

01

Outperforms previous SURT on multiple benchmarks

02

Achieves low WERs on LibriCSS, AMI, and ICSI datasets

03

Enables training with standard academic resources

Abstract

The Streaming Unmixing and Recognition Transducer (SURT) model was proposed recently as an end-to-end approach for continuous, streaming, multi-talker speech recognition (ASR). Despite impressive results on multi-turn meetings, SURT has notable limitations: (i) it suffers from leakage and omission related errors; (ii) it is computationally expensive, due to which it has not seen adoption in academia; and (iii) it has only been evaluated on synthetic mixtures. In this work, we propose several modifications to the original SURT which are carefully designed to fix the above limitations. In particular, we (i) change the unmixing module to a mask estimator that uses dual-path modeling, (ii) use a streaming zipformer encoder and a stateless decoder for the transducer, (iii) perform mixture simulation using force-aligned subsegments, (iv) pre-train the transducer on single-speaker data, (v)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k2-fsa/icefall/tree/master/egs/libricss/SURT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing