DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

Shangeth Rajaa

arXiv:2603.08216·eess.AS·March 10, 2026

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

Shangeth Rajaa

PDF

Open Access 2 Datasets

TL;DR

DualTurn is a generative speech model trained on dual-channel audio that learns natural turn-taking and predicts agent actions without labels, improving conversational flow and turn boundary anticipation.

Contribution

It introduces a novel dual-channel generative pretraining approach that captures conversational dynamics and predicts turn-taking signals directly from audio.

Findings

01

Outperforms existing models in agent action prediction (wF1 0.633 vs. 0.389)

02

Achieves higher word-level turn prediction accuracy (AUC 0.930 vs. 0.880)

03

Anticipates turn boundaries earlier with fewer interruptions

Abstract

Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, which narrows this gap through generative pretraining on dual-channel conversational audio. The model generates both speakers' future audio autoregressively, implicitly learning conversational dynamics without any labels, and is then fine-tuned to predict interpretable turn-taking signals that map directly to agent actions. DualTurn monitors both channels continuously, anticipating turn boundaries and producing five agent actions. On standard benchmarks, DualTurn (0.5B) outperforms both VAP on agent action prediction (wF1 0.633 vs. 0.389) and a 3.1B audio-text model on word-level turn prediction (AUC 0.930 vs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Phonetics and Phonology Research · Emotion and Mood Recognition