TL;DR
CapTalk is a unified voice design framework for speech synthesis that enables expressive, context-aware dialogue voice generation from natural language descriptions, advancing multimodal text-to-speech capabilities.
Contribution
It introduces a novel autoregressive model for both single-utterance and dialogue voice design, incorporating hierarchical variational conditioning and explicit turn-level control.
Findings
Achieves state-of-the-art results on single-utterance voice design benchmark.
Improves expression controllability and contextual appropriateness in dialogue.
Balances stable timbre preservation with adaptive expression in multi-turn conversations.
Abstract
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
