Improving endpoint detection in end-to-end streaming ASR for conversational speech

Anandh C; Karthik Pandia Durai; Jeena Prakash; Manickavela Arumugam; Kadri Hacioglu; S.Pavankumar Dubagunta; Andreas Stolcke; Shankar Venkatesan; Aravind Ganapathiraju

arXiv:2505.17070·cs.CL·May 26, 2025

Improving endpoint detection in end-to-end streaming ASR for conversational speech

Anandh C, Karthik Pandia Durai, Jeena Prakash, Manickavela Arumugam, Kadri Hacioglu, S.Pavankumar Dubagunta, Andreas Stolcke, Shankar Venkatesan, Aravind Ganapathiraju

PDF

TL;DR

This paper proposes novel methods to improve endpoint detection in streaming end-to-end ASR systems by reducing delays and errors, enhancing user experience in conversational speech applications.

Contribution

It introduces an end-of-word token with delay penalty and a reliable speech activity detection to address emission delays and endpoint errors in transducer-based ASR.

Findings

01

Reduced endpoint detection delay in experiments.

02

Improved accuracy of speech activity detection.

03

Enhanced user experience in conversational ASR.

Abstract

ASR endpointing (EP) plays a major role in delivering a good user experience in products supporting human or artificial agents in human-human/machine conversations. Transducer-based ASR (T-ASR) is an end-to-end (E2E) ASR modelling technique preferred for streaming. A major limitation of T-ASR is delayed emission of ASR outputs, which could lead to errors or delays in EP. Inaccurate EP will cut the user off while speaking, returning incomplete transcript while delays in EP will increase the perceived latency, degrading the user experience. We propose methods to improve EP by addressing delayed emission along with EP mistakes. To address the delayed emission problem, we introduce an end-of-word token at the end of each word, along with a delay penalty. The EP delay is addressed by obtaining a reliable frame-level speech activity detection using an auxiliary network. We apply the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.