Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training

Sathvik Udupa; Shinji Watanabe; Petr Schwarz; Jan Cernocky

arXiv:2506.07081·cs.SD·June 23, 2025

Streaming Endpointer for Spoken Dialogue using Neural Audio Codecs and Label-Delayed Training

Sathvik Udupa, Shinji Watanabe, Petr Schwarz, Jan Cernocky

PDF

Open Access

TL;DR

This paper presents a real-time speech endpointing method for spoken dialogue systems using neural audio codecs and label delay training, achieving significant error reduction and low latency.

Contribution

It introduces a novel label delay training scheme combined with neural audio codec features for low-latency endpointing in multi-turn dialogues.

Findings

01

42.7% reduction in cutoff errors for single-stream endpointer

02

37.5% reduction in cutoff errors for two-stream endpointer

03

1200 ms median response time improvement with integrated speech model

Abstract

Accurate, low-latency endpointing is crucial for effective spoken dialogue systems. While traditional endpointers often rely on spectrum-based audio features, this work proposes real-time speech endpointing for multi-turn dialogues using streaming, low-bitrate Neural Audio Codec (NAC) features, building upon recent advancements in neural audio codecs. To further reduce cutoff errors, we introduce a novel label delay training scheme. At a fixed median latency of 160 ms, our combined NAC and label delay approach achieves significant relative cutoff error reductions: 42.7% for a single-stream endpointer and 37.5% for a two-stream configuration, compared to baseline methods. Finally, we demonstrate efficient integration with a codec-based pretrained speech large language model, improving its median response time by 1200 ms and reducing its cutoff error by 35%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Speech and Audio Processing