Causal Speech Enhancement with Predicting Semantics based on Quantized   Self-supervised Learning Features

Emiru Tsunoo; Yuki Saito; Wataru Nakata; Hiroshi Saruwatari

arXiv:2412.19248·eess.AS·December 30, 2024

Causal Speech Enhancement with Predicting Semantics based on Quantized Self-supervised Learning Features

Emiru Tsunoo, Yuki Saito, Wataru Nakata, Hiroshi Saruwatari

PDF

Open Access

TL;DR

This paper introduces a novel causal speech enhancement model that leverages self-supervised learning features and semantic token prediction to improve real-time speech quality, demonstrating significant PESQ improvements.

Contribution

It is the first to combine SSL features with causality in an SE model, incorporating semantic token prediction via multi-task learning for enhanced performance.

Findings

01

Achieved PESQ of 2.88 on VoiceBank + DEMAND dataset.

02

Semantic prediction significantly improves speech enhancement quality.

03

First integration of SSL features and causality in real-time SE.

Abstract

Real-time speech enhancement (SE) is essential to online speech communication. Causal SE models use only the previous context while predicting future information, such as phoneme continuation, may help performing causal SE. The phonetic information is often represented by quantizing latent features of self-supervised learning (SSL) models. This work is the first to incorporate SSL features with causality into an SE model. The causal SSL features are encoded and combined with spectrogram features using feature-wise linear modulation to estimate a mask for enhancing the noisy input speech. Simultaneously, we quantize the causal SSL features using vector quantization to represent phonetic characteristics as semantic tokens. The model not only encodes SSL features but also predicts the future semantic tokens in multi-task learning (MTL). The experimental results using VoiceBank + DEMAND…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing