Inference-time Scaling for Diffusion-based Audio Super-resolution
Yizhu Jin, Zhen Ye, Zeyue Tian, Haohe Liu, Qiuqiang Kong, Yike Guo, Wei Xue

TL;DR
This paper introduces an inference-time scaling method for diffusion-based audio super-resolution that explores multiple solution trajectories to improve output quality without increasing sampling steps, validated across various audio domains.
Contribution
It proposes a novel inference-time scaling paradigm with verifier-guided search algorithms to enhance diffusion model outputs for audio super-resolution.
Findings
Achieved up to 9.70% improvement in aesthetics
Improved speaker similarity by 5.88%
Reduced spectral distance by 46.98% in speech SR
Abstract
Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Image Processing Techniques
