mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra
Chenhao Shuai, Chaohua Shi, Lu Gan, Hongqing Liu

TL;DR
mdctGAN is a novel speech super-resolution framework that leverages adversarial learning in the MDCT domain to produce high-quality, phase-aware speech reconstruction without vocoders, outperforming existing methods.
Contribution
The paper introduces mdctGAN, a phase-aware GAN-based SSR method using MDCT and self-attention, achieving state-of-the-art results without post-processing.
Findings
High MOS and PESQ scores on VCTK dataset
State-of-the-art LSD performance at 48 kHz
Effective phase-aware speech reconstruction
Abstract
Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart. Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction, thereby limiting the recovery quality. To address this issue, we propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT). By adversarial learning in the MDCT domain, our method reconstructs HR speeches in a phase-aware manner without vocoders or additional post-processing. Furthermore, by learning frequency consistent features with self-attentive mechanism, mdctGAN guarantees a high quality speech reconstruction. For VCTK corpus dataset, the experiment results show that our model produces natural auditory quality with high MOS and PESQ scores. It also achieves the state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Image and Signal Denoising Methods · Seismic Waves and Analysis
MethodsDiscrete Cosine Transform
