MAGIC-TTS: Fine-Grained Controllable Speech Synthesis with Explicit Local Duration and Pause Control
Jialong Mai, Xiaofen Xing, Xiangmin Xu

TL;DR
MAGIC-TTS introduces a novel text-to-speech model enabling explicit token-level timing control, improving local duration and pause manipulation while maintaining high speech quality.
Contribution
It is the first TTS system with explicit local timing control over token duration and pauses, enhancing fine-grained speech editing capabilities.
Findings
Substantially improves token-level duration and pause control.
Maintains natural high-quality synthesis without timing controls.
Effective in local editing scenarios like navigation and code reading.
Abstract
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowledge, MAGIC-TTS is the first TTS model with explicit local timing control over token-level content duration and pause. MAGIC-TTS is enabled by explicit token-level duration conditioning, carefully prepared high-confidence duration supervision, and training mechanisms that correct zero-value bias and make the model robust to missing local controls. On our timing-control benchmark, MAGIC-TTS substantially improves token-level duration and pause following over spontaneous synthesis. Even when no timing control is provided, MAGIC-TTS maintains natural high-quality synthesis. We further evaluate practical local editing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
