RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue

Long Mai

arXiv:2603.23346·cs.AI·March 25, 2026

RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue

Long Mai

PDF

Open Access 1 Datasets

TL;DR

RelayS2S is a hybrid real-time dialogue system that combines a fast speculative response draft with a high-quality slow response, achieving low latency without sacrificing response quality.

Contribution

It introduces a dual-path architecture with a speculative fast path and a high-quality slow path, enabling real-time dialogue with minimal latency and high response quality.

Findings

01

Achieves P90 onset latency comparable to end-to-end models

02

Retains 99% of cascaded pipeline response quality

03

Scalable benefits as slow-path model size increases

Abstract

Real-time spoken dialogue systems face a fundamental tension between latency and response quality. End-to-end speech-to-speech (S2S) models respond immediately and naturally handle turn-taking, backchanneling, and interruption, but produce semantically weaker outputs. Cascaded pipelines (ASR -> LLM) deliver stronger responses at the cost of latency that grows with model size. We present RelayS2S, a hybrid architecture that runs two paths in parallel upon turn detection. The fast path -- a duplex S2S model -- speculatively drafts a short response prefix that is streamed immediately to TTS for low-latency audio onset, while continuing to monitor live audio events. The slow path -- a cascaded ASR -> LLM pipeline -- generates a higher-quality continuation conditioned on the committed prefix, producing a seamless utterance. A lightweight learned verifier gates the handoff, committing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mailong225/speech_to_speech
dataset· 136 dl
136 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis