Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

Kuan-Po Huang; Bo-Ru Lu; Byeonggeun Kim; Mihee Lee; Zalan Fabian; Renard Korzeniowski; Qingming Tang; Greg Ver Steeg; Hung-yi Lee; Chieh-Chi Kao; Chao Wang

arXiv:2605.00329·cs.SD·May 4, 2026

Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

Kuan-Po Huang, Bo-Ru Lu, Byeonggeun Kim, Mihee Lee, Zalan Fabian, Renard Korzeniowski, Qingming Tang, Greg Ver Steeg, Hung-yi Lee, Chieh-Chi Kao, Chao Wang

PDF

TL;DR

This paper introduces a one-step text-to-audio generation method that significantly reduces latency while maintaining high audio quality, by combining energy-scoring and representation distillation techniques.

Contribution

It proposes a novel one-step sampling framework that replaces iterative diffusion with energy-scoring and distillation, achieving faster inference with competitive quality.

Findings

01

Outperforms prior one-step methods on AudioCaps benchmark.

02

Achieves up to 8.5x faster batch inference compared to AR diffusion systems.

03

Maintains high audio quality close to multi-step diffusion models.

Abstract

Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.