DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining
Shangeth Rajaa

TL;DR
DualTurn is a generative speech model trained on dual-channel audio that learns natural turn-taking and predicts agent actions without labels, improving conversational flow and turn boundary anticipation.
Contribution
It introduces a novel dual-channel generative pretraining approach that captures conversational dynamics and predicts turn-taking signals directly from audio.
Findings
Outperforms existing models in agent action prediction (wF1 0.633 vs. 0.389)
Achieves higher word-level turn prediction accuracy (AUC 0.930 vs. 0.880)
Anticipates turn boundaries earlier with fewer interruptions
Abstract
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Phonetics and Phonology Research · Emotion and Mood Recognition
