mdctGAN: Taming transformer-based GAN for speech super-resolution with   Modified DCT spectra

Chenhao Shuai; Chaohua Shi; Lu Gan; Hongqing Liu

arXiv:2305.11104·eess.AS·May 22, 2023·1 cites

mdctGAN: Taming transformer-based GAN for speech super-resolution with Modified DCT spectra

Chenhao Shuai, Chaohua Shi, Lu Gan, Hongqing Liu

PDF

Open Access 1 Repo

TL;DR

mdctGAN is a novel speech super-resolution framework that leverages adversarial learning in the MDCT domain to produce high-quality, phase-aware speech reconstruction without vocoders, outperforming existing methods.

Contribution

The paper introduces mdctGAN, a phase-aware GAN-based SSR method using MDCT and self-attention, achieving state-of-the-art results without post-processing.

Findings

01

High MOS and PESQ scores on VCTK dataset

02

State-of-the-art LSD performance at 48 kHz

03

Effective phase-aware speech reconstruction

Abstract

Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart. Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction, thereby limiting the recovery quality. To address this issue, we propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT). By adversarial learning in the MDCT domain, our method reconstructs HR speeches in a phase-aware manner without vocoders or additional post-processing. Furthermore, by learning frequency consistent features with self-attentive mechanism, mdctGAN guarantees a high quality speech reconstruction. For VCTK corpus dataset, the experiment results show that our model produces natural auditory quality with high MOS and PESQ scores. It also achieves the state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neoncloud/mdctgan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Image and Signal Denoising Methods · Seismic Waves and Analysis

MethodsDiscrete Cosine Transform