CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Zihao Zheng; Wen Wu; Chao Zhang; Mengyue Wu; and Xuenan Xu

arXiv:2603.16280·cs.SD·March 18, 2026

CAST-TTS: A Simple Cross-Attention Framework for Unified Timbre Control in TTS

Zihao Zheng, Wen Wu, Chao Zhang, Mengyue Wu, and Xuenan Xu

PDF

Open Access

TL;DR

CAST-TTS introduces a unified cross-attention framework that effectively combines speech and text prompts for timbre control in TTS, simplifying architecture while maintaining high-quality synthesis.

Contribution

It proposes a simple, multi-stage training framework with a shared embedding space and a single cross-attention mechanism for unified timbre control in TTS.

Findings

01

Achieves performance comparable to specialized models

02

Effectively aligns speech and text representations

03

Validates the importance of cross-attention for quality

Abstract

Current Text-to-Speech (TTS) systems typically use separate models for speech-prompted and text-prompted timbre control. While unifying both control signals into a single model is desirable, the challenge of cross-modal alignment often results in overly complex architectures and training objective. To address this challenge, we propose CAST-TTS, a simple yet effective framework for unified timbre control. Features are extracted from speech prompts and text prompts using pre-trained encoders. The multi-stage training strategy efficiently aligns the speech and projected text representations within a shared embedding space. A single cross-attention mechanism then allows the model to use either of these representations to control the timbre. Extensive experiments validate that the unified cross-attention mechanism is critical for achieving high-quality synthesis. CAST-TTS achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music Technology and Sound Studies