A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss
Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Rashedul Hasan, Taieba Athay, Nursad Mamun, Anomadarshi Barua

TL;DR
This paper introduces CTFT-Net, a novel speech super-resolution network that reconstructs both magnitude and phase in complex domains, utilizing a global attention module and multi-resolution loss to significantly improve high-frequency speech reconstruction.
Contribution
The paper presents a complex domain SSR network with a global attention module and multi-resolution loss, advancing phase reconstruction and noise robustness over prior magnitude-focused methods.
Findings
Outperforms state-of-the-art models on VCTK dataset
Effective in extreme upsampling from 2 kHz to 48 kHz
Reconstructs high frequencies without noisy artifacts
Abstract
Speech super-resolution (SSR) enhances low-resolution speech by increasing the sampling rate. While most SSR methods focus on magnitude reconstruction, recent research highlights the importance of phase reconstruction for improved perceptual quality. Therefore, we introduce CTFT-Net, a Complex Time-Frequency Transformation Network that reconstructs both magnitude and phase in complex domains for improved SSR tasks. It incorporates a complex global attention block to model inter-phoneme and inter-frequency dependencies and a complex conformer to capture long-range and local features, improving frequency reconstruction and noise robustness. CTFT-Net employs time-domain and multi-resolution frequency-domain loss functions for better generalization. Experiments show CTFT-Net outperforms state-of-the-art models (NU-Wave, WSRGlow, NVSR, AERO) on the VCTK dataset, particularly for extreme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Image Processing Techniques · Advanced Data Compression Techniques
