A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss

Tarikul Islam Tamiti; Biraj Joshi; Rida Hasan; Rashedul Hasan; Taieba Athay; Nursad Mamun; Anomadarshi Barua

arXiv:2507.00229·cs.SD·July 2, 2025

A High-Fidelity Speech Super Resolution Network using a Complex Global Attention Module with Spectro-Temporal Loss

Tarikul Islam Tamiti, Biraj Joshi, Rida Hasan, Rashedul Hasan, Taieba Athay, Nursad Mamun, Anomadarshi Barua

PDF

Open Access

TL;DR

This paper introduces CTFT-Net, a novel speech super-resolution network that reconstructs both magnitude and phase in complex domains, utilizing a global attention module and multi-resolution loss to significantly improve high-frequency speech reconstruction.

Contribution

The paper presents a complex domain SSR network with a global attention module and multi-resolution loss, advancing phase reconstruction and noise robustness over prior magnitude-focused methods.

Findings

01

Outperforms state-of-the-art models on VCTK dataset

02

Effective in extreme upsampling from 2 kHz to 48 kHz

03

Reconstructs high frequencies without noisy artifacts

Abstract

Speech super-resolution (SSR) enhances low-resolution speech by increasing the sampling rate. While most SSR methods focus on magnitude reconstruction, recent research highlights the importance of phase reconstruction for improved perceptual quality. Therefore, we introduce CTFT-Net, a Complex Time-Frequency Transformation Network that reconstructs both magnitude and phase in complex domains for improved SSR tasks. It incorporates a complex global attention block to model inter-phoneme and inter-frequency dependencies and a complex conformer to capture long-range and local features, improving frequency reconstruction and noise robustness. CTFT-Net employs time-domain and multi-resolution frequency-domain loss functions for better generalization. Experiments show CTFT-Net outperforms state-of-the-art models (NU-Wave, WSRGlow, NVSR, AERO) on the VCTK dataset, particularly for extreme…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Image Processing Techniques · Advanced Data Compression Techniques