Exploiting Consistency-Preserving Loss and Perceptual Contrast   Stretching to Boost SSL-based Speech Enhancement

Muhammad Salman Khan; Moreno La Quatra; Kuo-Hsuan Hung; Szu-Wei Fu,; Sabato Marco Siniscalchi; Yu Tsao

arXiv:2408.04773·cs.SD·August 12, 2024

Exploiting Consistency-Preserving Loss and Perceptual Contrast Stretching to Boost SSL-based Speech Enhancement

Muhammad Salman Khan, Moreno La Quatra, Kuo-Hsuan Hung, Szu-Wei Fu,, Sabato Marco Siniscalchi, Yu Tsao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel SSL-based speech enhancement method combining Transformer-based masking, a consistency-preserving loss, and perceptual contrast stretching, achieving state-of-the-art PESQ scores on VoiceBank-DEMAND.

Contribution

It proposes a new SSL speech enhancement framework that integrates Transformer masking, consistency loss, and perceptual contrast stretching for improved performance.

Findings

01

Achieved a PESQ score of 3.54, surpassing previous SSL-based methods.

02

Demonstrated effectiveness of consistency-preserving loss in speech enhancement.

03

Showed that perceptual contrast stretching enhances feature contrast for better enhancement.

Abstract

Self-supervised representation learning (SSL) has attained SOTA results on several downstream speech tasks, but SSL-based speech enhancement (SE) solutions still lag behind. To address this issue, we exploit three main ideas: (i) Transformer-based masking generation, (ii) consistency-preserving loss, and (iii) perceptual contrast stretching (PCS). In detail, conformer layers, leveraging an attention mechanism, are introduced to effectively model frame-level representations and obtain the Ideal Ratio Mask (IRM) for SE. Moreover, we incorporate consistency in the loss function, which processes the input to account for the inconsistency effects of signal reconstruction from the spectrogram. Finally, PCS is employed to improve the contrast of input and target features according to perceptual importance. Evaluated on the VoiceBank-DEMAND task, the proposed solution outperforms previously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salman18376/SE-SSL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Phonetics and Phonology Research