Endpoint Detection for Streaming End-to-End Multi-talker ASR
Liang Lu, Jinyu Li, Yifan Gong

TL;DR
This paper enhances streaming multi-talker speech recognition by integrating endpoint detection into the SURT model using an end-of-sentence token and latency penalties, achieving accurate detection with minimal accuracy loss.
Contribution
It introduces endpoint detection into the SURT framework with an end-of-sentence token and latency penalty, a novel approach for multi-talker end-to-end models.
Findings
Effective endpoint detection achieved without significant accuracy loss.
Latency penalty reduces detection delay substantially.
Model performs well on LibrispeechMix dataset.
Abstract
Streaming end-to-end multi-talker speech recognition aims at transcribing the overlapped speech from conversations or meetings with an all-neural model in a streaming fashion, which is fundamentally different from a modular-based approach that usually cascades the speech separation and the speech recognition models trained independently. Previously, we proposed the Streaming Unmixing and Recognition Transducer (SURT) model based on recurrent neural network transducer (RNN-T) for this problem and presented promising results. However, for real applications, the speech recognition system is also required to determine the timestamp when a speaker finishes speaking for prompt system response. This problem, known as endpoint (EP) detection, has not been studied previously for multi-talker end-to-end models. In this work, we address the EP detection problem in the SURT framework by introducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
