CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction

Yudong Lu; Yazhe Niu; Shuai Hu; Haolin Wang

arXiv:2506.01268·cs.AI·June 3, 2025

CleanS2S: Single-file Framework for Proactive Speech-to-Speech Interaction

Yudong Lu, Yazhe Niu, Shuai Hu, Haolin Wang

PDF

Open Access

TL;DR

CleanS2S introduces a unified, real-time speech-to-speech interaction framework with proactive dialogue capabilities, enabling more human-like and flexible conversational AI through novel memory and action judgment modules.

Contribution

It presents a single-file, transparent framework that integrates speech recognition, large language models, and proactive response strategies for conversational AI.

Findings

01

Achieves low latency with full-duplex websocket connections.

02

Supports five human-like response strategies.

03

Provides an extensible, transparent implementation for research.

Abstract

CleanS2S is a framework for human-like speech-to-speech interaction that advances conversational AI through single-file implementation and proactive dialogue capabilities. Our system integrates automatic speech recognition, large language models, and text-to-speech synthesis into a unified pipeline with real-time interruption handling, achieving low transition latency through full-duplex websocket connections and non-blocking I/O. Beyond conventional chatbot paradigms, we pioneer a proactive interaction mechanism, which combines memory systems with Subjective Action Judgement module, enabling five human-like response strategies: interruption, refusal, deflection, silence, and standard response. The memory module dynamically aggregates historical, and contextual data to inform interaction decisions. This approach breaks the rigid turn-based convention by allowing system-initiated dialog…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis