PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

Qinghua Liu; Meng Ge; Zhizheng Wu; Haizhou Li

arXiv:2309.06723·cs.SD·September 14, 2023

PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

Qinghua Liu, Meng Ge, Zhizheng Wu, Haizhou Li

PDF

TL;DR

PIAVE is a novel audio-visual speaker extraction network that effectively handles head pose variations by generating pose-invariant views, leading to improved robustness and performance in multi-view and in-the-wild scenarios.

Contribution

The paper introduces a pose-invariant visual feature generation method for audio-visual speaker extraction, enhancing robustness to head pose changes.

Findings

01

PIAVE outperforms state-of-the-art methods on MEAD and LRS3 datasets.

02

The model is more robust to pose variations.

03

Experimental results show improved speaker extraction accuracy.

Abstract

It is common in everyday spoken communication that we look at the turning head of a talker to listen to his/her voice. Humans see the talker to listen better, so do machines. However, previous studies on audio-visual speaker extraction have not effectively handled the varying talking face. This paper studies how to take full advantage of the varying talking face. We propose a Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that incorporates an additional pose-invariant view to improve audio-visual speaker extraction. Specifically, we generate the pose-invariant view from each original pose orientation, which enables the model to receive a consistent frontal view of the talker regardless of his/her head pose, therefore, forming a multi-view visual input for the speaker. Experiments on the multi-view MEAD and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.