Presto! Distilling Steps and Layers for Accelerating Music Generation
Zachary Novack, Ge Zhu, Jonah Casebeer, Julian McAuley, Taylor, Berg-Kirkpatrick, Nicholas J. Bryan

TL;DR
Presto! introduces a dual distillation approach that significantly accelerates diffusion-based text-to-music generation, achieving state-of-the-art speed and quality improvements through novel step and layer distillation methods.
Contribution
The paper presents the first GAN-based distillation for TTM and combines step and layer distillation techniques to greatly enhance inference speed and output quality.
Findings
Achieves 10-18x faster music generation
Produces high-quality, diverse music outputs
Outperforms existing state-of-the-art methods in speed and quality
Abstract
Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. Very solid target: enhancing efficiency from both the number of steps and the runtime of a single ste. 2. novel to audio community to combine both step distillation and layer distillation, on top of the successful usage of an adversarial loss. 3. Clear presentation. The origin of the proposed techniques, and the differences between Presto! and prior arts are well explained. Related works are very up-to-date hence informative. 4. Great objective measurements. Presto! could outperform or keep o
The ablation study indicates quite high complexity if we want to stably and effectively use Presto!. I am interested in how well the framework could be when applied to other VAEs or generator backbones, especially if no resource is open.
This paper marks an important step toward a unified approach for faster generative models by bridging adversarial and diffusion methods for text-to-music tasks. In particular, Section 3.1 demonstrates this by applying GAN-based distillation to continuous-time score-based diffusion models. Aligning most distributions with the training distribution, as shown in Table 1, empirically improves all metrics, underscoring the robustness of their approach. The authors also navigate the task of combini
While the paper is well-structured in a general sense, certain sections are hard to follow due to the level of detail. For example, Figure 1 is dense and, despite a detailed caption, remains challenging to interpret. The figure also uses various notations that lack prior explanation, making it difficult for readers to follow the intended process. Additionally, the models are trained on internally licensed data, so the results cannot be easily referenced in future work. With no indication that t
1. The paper is well written. I could understand their motivation, the proposed method, and the experimental results. 2. The authors modified the existing DMD2 and ASE frameworks while conducting ablation studies. As ablation studies are comprehensive, readers can understand the reasons for the modifications/choices. 3. The authors demonstrated that the proposed distillation method works well in several aspects (AD, MMD, CLAP scores, and subjective evaluation) despite its fast generation.
I do not think that the paper has a critical weak point, but let me mention some weaknesses. - Although the paper is well-written, some readers would struggle to reproduce the results without their code public. (I understand some institutes/companies do not make their code publicly available due to their policy.) - Since the proposed training framework is general, I thought I would like to see experimental results in the text-to-audio generation task evaluated on AudioCaps. Furthermore, if we h
Videos
Taxonomy
TopicsMusic Technology and Sound Studies
MethodsBalanced Selection · Diffusion
