Fast and Flexible Audio Bandwidth Extension via Vocos
Yatharth Sharma

TL;DR
This paper introduces a neural vocoder-based bandwidth extension model that efficiently enhances audio quality across a wide frequency range, supporting arbitrary upsampling ratios with real-time performance.
Contribution
It presents a novel Vocos-based model that combines neural vocoding with a lightweight refiner for flexible, high-quality audio bandwidth extension at high speeds.
Findings
Achieves competitive spectral distance metrics.
Operates in real-time on high-end GPUs.
Supports arbitrary upsampling ratios.
Abstract
We propose a Vocos-based bandwidth extension model that enhances audio at 8-48 kHz by generating missing high-frequency content. Inputs are resampled to 48 kHz and processed by a neural vocoder backbone, enabling a single network to support arbitrary upsampling ratios. A lightweight Linkwitz-Riley-inspired refiner merges the original low band with the generated high frequencies via a smooth crossover. On validation, the model achieves competitive log-spectral distance while running at a real-time factor of 0.0001 on an NVIDIA A100 GPU and 0.0053 on an 8-core CPU, demonstrating practical, high-quality BWE at extreme throughput.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
