Audio-Visual Speech Enhancement in Noisy Environments via Emotion-Based   Contextual Cues

Tassadaq Hussain; Kia Dashtipour; Yu Tsao; Amir Hussain

arXiv:2402.16394·eess.AS·February 27, 2024·1 cites

Audio-Visual Speech Enhancement in Noisy Environments via Emotion-Based Contextual Cues

Tassadaq Hussain, Kia Dashtipour, Yu Tsao, Amir Hussain

PDF

Open Access

TL;DR

This paper introduces an emotion-aware audio-visual speech enhancement system that leverages emotional cues from facial features to improve speech clarity and intelligibility in noisy environments, outperforming existing methods.

Contribution

The study presents a novel AVSE approach incorporating emotional features extracted from facial landmarks, enhancing speech enhancement performance in dynamic noise conditions.

Findings

01

Significant improvements in PESQ and STOI scores.

02

Enhanced subjective and objective speech quality assessments.

03

Better human comprehension of enhanced speech.

Abstract

In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Audio-visual speech enhancement (AVSE) attempts to restore speech quality, but existing methods often fall short, particularly in dynamic noise conditions. This study investigates the inclusion of emotion as a novel contextual cue within AVSE, hypothesizing that incorporating emotional understanding can improve speech enhancement performance. We propose a novel emotion-aware AVSE system that leverages both auditory and visual information. It extracts emotional features from the facial landmarks of the speaker and fuses them with corresponding audio and visual modalities. This enriched data serves as input to a deep UNet-based encoder-decoder network, specifically designed to orchestrate the fusion of multimodal information enhanced with emotion. The network iteratively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation