BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective
Andong Li, Tong Lei, Rilin Chen, Kai Li, Meng Yu, Xiaodong Li, Dong Yu, Chengshi Zheng

TL;DR
BridgeVoC introduces a novel diffusion vocoder framework that reinterprets neural vocoding as an audio restoration task, leveraging the Schrödinger bridge and hierarchical subband priors for efficient, high-quality speech synthesis with fewer parameters and steps.
Contribution
The paper proposes a diffusion-based neural vocoder using a restoration perspective, novel subband-aware convolutional network, and single-step inference via distillation, achieving state-of-the-art results.
Findings
Outperforms existing GAN, DDPM, and flow-matching vocoders.
Achieves state-of-the-art quality with only 4 sampling steps.
Maintains superior performance with single-step inference.
Abstract
This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast the vocoder task as a specialized case of audio restoration, where the range-space spectral (RSS) surrogate of the target spectrum acts as the degraded input. Based on that, we introduce the Schrodinger bridge framework for diffusion modeling, which defines the RSS and target spectrum as dual endpoints of the stochastic generation trajectory. Further, to fully utilize the hierarchical prior of subbands in the time-frequency (T-F) domain, we elaborately devise a novel subband-aware convolutional diffusion network as the data predictor, where subbands are divided following an uneven strategy, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
