Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial   Networks With Human-in-the-Loop Assessment Metrics

Syu-Siang Wang; Jia-Yang Chen; Bo-Ren Bai; Shih-Hau Fang; Yu Tsao

arXiv:2407.01939·eess.AS·July 23, 2024·IEEE ACM Trans. Audio Speech Lang. Process.

Unsupervised Face-Masked Speech Enhancement Using Generative Adversarial Networks With Human-in-the-Loop Assessment Metrics

Syu-Siang Wang, Jia-Yang Chen, Bo-Ren Bai, Shih-Hau Fang, Yu Tsao

PDF

Open Access

TL;DR

This paper introduces HL-StarGAN, a novel unsupervised face-masked speech enhancement method that incorporates human-in-the-loop assessment metrics, improving speech quality in masked communication scenarios.

Contribution

The paper presents a new face-masked speech enhancement model with a human-in-the-loop metric predictor, trained on a curated database, outperforming existing methods in quality prediction and speech enhancement.

Findings

01

MaskQSS accurately predicts face-masked speech quality.

02

HL-StarGAN outperforms conventional StarGAN and CycleGAN in speech enhancement.

03

The method effectively improves speech quality in face-masked scenarios.

Abstract

The utilization of face masks is an essential healthcare measure, particularly during times of pandemics, yet it can present challenges in communication in our daily lives. To address this problem, we propose a novel approach known as the human-in-the-loop StarGAN (HL-StarGAN) face-masked speech enhancement method. HL-StarGAN comprises discriminator, classifier, metric assessment predictor, and generator that leverages an attention mechanism. The metric assessment predictor, referred to as MaskQSS, incorporates human participants in its development and serves as a "human-in-the-loop" module during the learning process of HL-StarGAN. The overall HL-StarGAN model was trained using an unsupervised learning strategy that simultaneously focuses on the reconstruction of the original clean speech and the optimization of human perception. To implement HL-StarGAN, we curated a face-masked speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Infant Health and Development

MethodsSoftmax · Attention Is All You Need