AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech
Bin Kang, Shaoguo Wen, Yang Fan, Shunlong Wu, Junjie Wang, Yulin Li, Junzhi Zhao, Junle Wang, and Zhuotao Tian

TL;DR
AgentSteerTTS introduces a multi-agent closed-loop framework that enhances expressive control in text-to-speech systems by disentangling speaker and emotion features, grounding intents with acoustic prototypes, and refining output through feedback.
Contribution
The paper presents a novel multi-agent framework for intent-faithful expressive TTS, combining disentanglement, prototype grounding, and feedback refinement for improved control.
Findings
Significant improvements over baselines on composite-instruction benchmark.
Effective disentanglement of speaker identity and emotion-prosody.
Enhanced expressiveness and fidelity in synthesized speech.
Abstract
While existing text-to-speech (TTS) models exhibit high expressiveness, fine-grained control over composite instructions remains challenging due to the structural mismatch between discrete textual intents and continuous acoustic realizations. Inspired by human cognitive decoupling, we introduce AgentSteerTTS, a multi-agent closed-loop framework designed for intent-faithful expressive control of composite instructions. First, in our framework, an adversarial disentanglement agent mitigates speaker-emotion leakage by learning separable identity and emotion-prosody subspaces with leakage-suppressing regularization. Next, a Dual-Stream Anchoring Controller grounds abstract intents using a large-scale acoustic prototype library: a Retrieval Agent selects expressive anchors, while a Synthesis Agent fuses them into continuous control vectors via gated attention. Finally, a Fast-Slow Feedback…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
