
TL;DR
This paper introduces stage-aware mechanisms for training audio diffusion models, dynamically adjusting guidance and optimization focus based on training progress, leading to improved efficiency and performance.
Contribution
It proposes a novel progress-based regime variable and three stage-aware training strategies that adapt guidance and regularization during training.
Findings
Stage-aware methods improve convergence in text-conditioned audio generation.
The strategies yield higher spectral reconstruction metrics.
Training efficiency is enhanced by stage-dependent guidance and regularization.
Abstract
Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remains computationally expensive, and most existing pipelines still rely on static optimization recipes that treat the relative importance of training signals as fixed throughout learning. In this work, we argue that a major source of inefficiency lies in the evolving balance between semantic acquisition and generation-oriented refinement. Early training places stronger emphasis on acquiring condition-aligned semantic structure and coarse global organization, whereas later training increasingly emphasizes temporal consistency, perceptual fidelity, and fine-detail refinement. To characterize this evolving balance, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
