FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge

Nabarun Goswami; Tatsuya Harada

arXiv:2506.00809·cs.SD·June 3, 2025

FUSE: Universal Speech Enhancement using Multi-Stage Fusion of Sparse Compression and Token Generation Models for the URGENT 2025 Challenge

Nabarun Goswami, Tatsuya Harada

PDF

Open Access

TL;DR

This paper introduces a multi-stage universal speech enhancement system that combines source separation, generative refinement, and fusion techniques to improve speech quality across diverse noisy conditions for the URGENT 2025 Challenge.

Contribution

It presents a novel multi-stage framework integrating sparse compression, generative modeling, and fusion for robust speech enhancement in multilingual, noisy environments.

Findings

01

Effective in challenging multilingual datasets

02

Improves both signal fidelity and perceptual quality

03

Outperforms baseline methods on URGENT Challenge metrics

Abstract

We propose a multi-stage framework for universal speech enhancement, designed for the Interspeech 2025 URGENT Challenge. Our system first employs a Sparse Compression Network to robustly separate sources and extract an initial clean speech estimate from noisy inputs. This is followed by an efficient generative model that refines speech quality by leveraging self-supervised features and optimizing a masked language modeling objective on acoustic tokens derived from a neural audio codec. In the final stage, a fusion network integrates the outputs of the first two stages with the original noisy signal, achieving a balanced improvement in both signal fidelity and perceptual quality. Additionally, a shift trick that aggregates multiple time-shifted predictions, along with output blending, further boosts performance. Experimental results on challenging multilingual datasets with variable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Infant Health and Development