The impact of removing head movements on audio-visual speech enhancement

Zhiqi Kang; Mostafa Sadeghi; Radu Horaud; Xavier Alameda-Pineda; Jacob; Donley; Anurag Kumar

arXiv:2202.00538·cs.SD·February 3, 2022

The impact of removing head movements on audio-visual speech enhancement

Zhiqi Kang, Mostafa Sadeghi, Radu Horaud, Xavier Alameda-Pineda, Jacob, Donley, Anurag Kumar

PDF

Open Access

TL;DR

This paper examines how head movements affect audio-visual speech enhancement and introduces a face frontalization technique combined with a VAE-based AVSE model to improve performance.

Contribution

It proposes a robust face frontalization method to mitigate head movement effects in AVSE, enhancing model robustness and performance.

Findings

01

RFF significantly improves AVSE performance

02

Head movements challenge existing AVSE models

03

Experimental results show increased scores on STOI, PESQ, SI-SDR

Abstract

This paper investigates the impact of head movements on audio-visual speech enhancement (AVSE). Although being a common conversational feature, head movements have been ignored by past and recent studies: they challenge today's learning-based methods as they often degrade the performance of models that are trained on clean, frontal, and steady face images. To alleviate this problem, we propose to use robust face frontalization (RFF) in combination with an AVSE method based on a variational auto-encoder (VAE) model. We briefly describe the basic ingredients of the proposed pipeline and we perform experiments with a recently released audio-visual dataset. In the light of these experiments, and based on three standard metrics, namely STOI, PESQ and SI-SDR, we conclude that RFF improves the performance of AVSE by a considerable margin.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Facial Nerve Paralysis Treatment and Research