Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
Juncheng Wang, Zhe Hu, Chao Xu, Siyue Ren, Yuxiang Feng, Yang Liu, Baigui Sun, Shujun Wang

TL;DR
This paper introduces Plan-Critic, a guiding mechanism for autoregressive text-to-audio models that improves fidelity to complex prompts by leveraging implicit planning in early tokens, leading to state-of-the-art results.
Contribution
The paper reveals that autoregressive audio models encode global semantics early and proposes Plan-Critic, a new guided decoding method that enhances prompt fidelity without extra computational cost.
Findings
Achieved up to 10-point CLAP score improvement over baseline.
Demonstrated implicit planning in early prefix tokens of AR models.
Established new state-of-the-art in AR text-to-audio generation.
Abstract
Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference time, Plan-Critic enables guided exploration: it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music Technology and Sound Studies
