HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial   Network for High-Fidelity Speech Super-Resolution

Shengkui Zhao; Kun Zhou; Zexu Pan; Yukun Ma; Chong Zhang; Bin Ma

arXiv:2501.10045·cs.SD·January 20, 2025

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

Shengkui Zhao, Kun Zhou, Zexu Pan, Yukun Ma, Chong Zhang, Bin Ma

PDF

Open Access 1 Repo 1 Datasets

TL;DR

HiFi-SR introduces a unified transformer-convolutional adversarial network for high-fidelity speech super-resolution, effectively enhancing speech quality and frequency range from low to high sampling rates through end-to-end training.

Contribution

The paper presents a novel unified transformer-convolutional generator with adversarial training for speech super-resolution, improving consistency and quality over prior separate or non-end-to-end methods.

Findings

01

Outperforms existing SR methods in objective metrics

02

Effective in both in-domain and out-of-domain scenarios

03

Capable of upscaling from 4 kHz to 48 kHz

Abstract

The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

modelscope/ClearerVoice-Studio
pytorchOfficial

Datasets

alibabasglab/LJSpeech-1.1-48kHz
dataset· 71 dl
71 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Ultrasonics and Acoustic Wave Propagation