CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Xiaosu Su; Zihan Sun; Peilei Jia; Jun Gao

arXiv:2604.08363·cs.SD·April 10, 2026

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

Xiaosu Su, Zihan Sun, Peilei Jia, Jun Gao

PDF

1 Repo

TL;DR

CapTalk is a unified voice design framework for speech synthesis that enables expressive, context-aware dialogue voice generation from natural language descriptions, advancing multimodal text-to-speech capabilities.

Contribution

It introduces a novel autoregressive model for both single-utterance and dialogue voice design, incorporating hierarchical variational conditioning and explicit turn-level control.

Findings

01

Achieves state-of-the-art results on single-utterance voice design benchmark.

02

Improves expression controllability and contextual appropriateness in dialogue.

03

Balances stable timbre preservation with adaptive expression in multi-turn conversations.

Abstract

Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.