GenSE: Generative Speech Enhancement via Language Models using   Hierarchical Modeling

Jixun Yao; Hexin Liu; Chen Chen; Yuchen Hu; EngSiong Chng; Lei Xie

arXiv:2502.02942·eess.AS·February 6, 2025

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

Jixun Yao, Hexin Liu, Chen Chen, Yuchen Hu, EngSiong Chng, Lei Xie

PDF

Open Access 1 Video

TL;DR

GenSE introduces a hierarchical, language model-based framework for speech enhancement that leverages semantic and acoustic tokens to improve speech quality and robustness in noisy environments.

Contribution

It proposes a novel hierarchical modeling approach that decouples semantic and acoustic token generation, enhancing stability and timbre consistency in speech enhancement.

Findings

01

Outperforms state-of-the-art systems in speech quality

02

Demonstrates strong generalization on benchmark datasets

03

Utilizes semantic tokens for improved intelligibility

Abstract

Semantic information refers to the meaning conveyed through words, phrases, and contextual relationships within a given linguistic structure. Humans can leverage semantic information, such as familiar linguistic patterns and contextual cues, to reconstruct incomplete or masked speech signals in noisy environments. However, existing speech enhancement (SE) approaches often overlook the rich semantic information embedded in speech, which is crucial for improving intelligibility, speaker consistency, and overall quality of enhanced speech signals. To enrich the SE model with semantic information, we employ language models as an efficient semantic learner and propose a comprehensive framework tailored for language model-based speech enhancement, called \textit{GenSE}. Specifically, we approach SE as a conditional language modeling task rather than a continuous signal regression problem…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems