Inference-time Scaling for Diffusion-based Audio Super-resolution

Yizhu Jin; Zhen Ye; Zeyue Tian; Haohe Liu; Qiuqiang Kong; Yike Guo; Wei Xue

arXiv:2508.02391·cs.SD·August 5, 2025

Inference-time Scaling for Diffusion-based Audio Super-resolution

Yizhu Jin, Zhen Ye, Zeyue Tian, Haohe Liu, Qiuqiang Kong, Yike Guo, Wei Xue

PDF

Open Access 1 Video

TL;DR

This paper introduces an inference-time scaling method for diffusion-based audio super-resolution that explores multiple solution trajectories to improve output quality without increasing sampling steps, validated across various audio domains.

Contribution

It proposes a novel inference-time scaling paradigm with verifier-guided search algorithms to enhance diffusion model outputs for audio super-resolution.

Findings

01

Achieved up to 9.70% improvement in aesthetics

02

Improved speaker similarity by 5.88%

03

Reduced spectral distance by 46.98% in speech SR

Abstract

Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Inference-time Scaling for Diffusion-based Audio Super-resolution· underline

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Image Processing Techniques