Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding

Juncheng Wang; Zhe Hu; Chao Xu; Siyue Ren; Yuxiang Feng; Yang Liu; Baigui Sun; Shujun Wang

arXiv:2601.14304·cs.CL·January 22, 2026

Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding

Juncheng Wang, Zhe Hu, Chao Xu, Siyue Ren, Yuxiang Feng, Yang Liu, Baigui Sun, Shujun Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces Plan-Critic, a guiding mechanism for autoregressive text-to-audio models that improves fidelity to complex prompts by leveraging implicit planning in early tokens, leading to state-of-the-art results.

Contribution

The paper reveals that autoregressive audio models encode global semantics early and proposes Plan-Critic, a new guided decoding method that enhances prompt fidelity without extra computational cost.

Findings

01

Achieved up to 10-point CLAP score improvement over baseline.

02

Demonstrated implicit planning in early prefix tokens of AR models.

03

Established new state-of-the-art in AR text-to-audio generation.

Abstract

Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference time, Plan-Critic enables guided exploration: it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding· underline

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music Technology and Sound Studies