BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective

Andong Li; Tong Lei; Rilin Chen; Kai Li; Meng Yu; Xiaodong Li; Dong Yu; Chengshi Zheng

arXiv:2511.07116·cs.SD·November 11, 2025

BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective

Andong Li, Tong Lei, Rilin Chen, Kai Li, Meng Yu, Xiaodong Li, Dong Yu, Chengshi Zheng

PDF

Open Access

TL;DR

BridgeVoC introduces a novel diffusion vocoder framework that reinterprets neural vocoding as an audio restoration task, leveraging the Schrödinger bridge and hierarchical subband priors for efficient, high-quality speech synthesis with fewer parameters and steps.

Contribution

The paper proposes a diffusion-based neural vocoder using a restoration perspective, novel subband-aware convolutional network, and single-step inference via distillation, achieving state-of-the-art results.

Findings

01

Outperforms existing GAN, DDPM, and flow-matching vocoders.

02

Achieves state-of-the-art quality with only 4 sampling steps.

03

Maintains superior performance with single-step inference.

Abstract

This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis