Improving Practical Aspects of End-to-End Multi-Talker Speech Recognition for Online and Offline Scenarios
Aswin Shanmugam Subramanian, Amit Das, Naoyuki Kanda, Jinyu Li, Xiaofei Wang, Yifan Gong

TL;DR
This paper enhances multi-talker speech recognition by integrating continuous speech separation with end-to-end models, optimizing for both streaming and offline applications through novel architectures and segmentation techniques.
Contribution
It introduces a unified framework combining CSS with E2E ASR, and proposes dual or cascaded models and segment-based SOT for improved multi-talker recognition.
Findings
CSS improves overlapped speech separation accuracy.
Dual models enable effective streaming and offline recognition.
Segment-based SOT enhances transcription readability.
Abstract
We extend the frameworks of Serialized Output Training (SOT) to address practical needs of both streaming and offline automatic speech recognition (ASR) applications. Our approach focuses on balancing latency and accuracy, catering to real-time captioning and summarization requirements. We propose several key improvements: (1) Leveraging Continuous Speech Separation (CSS) single-channel front-end with end-to-end (E2E) systems for highly overlapping scenarios, challenging the conventional wisdom of E2E versus cascaded setups. The CSS framework improves the accuracy of the ASR system by separating overlapped speech from multiple speakers. (2) Implementing dual models -- Conformer Transducer for streaming and Sequence-to-Sequence for offline -- or alternatively, a two-pass model based on cascaded encoders. (3) Exploring segment-based SOT (segSOT) which is better suited for offline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
