PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
Qinghua Liu, Meng Ge, Zhizheng Wu, Haizhou Li

TL;DR
PIAVE is a novel audio-visual speaker extraction network that effectively handles head pose variations by generating pose-invariant views, leading to improved robustness and performance in multi-view and in-the-wild scenarios.
Contribution
The paper introduces a pose-invariant visual feature generation method for audio-visual speaker extraction, enhancing robustness to head pose changes.
Findings
PIAVE outperforms state-of-the-art methods on MEAD and LRS3 datasets.
The model is more robust to pose variations.
Experimental results show improved speaker extraction accuracy.
Abstract
It is common in everyday spoken communication that we look at the turning head of a talker to listen to his/her voice. Humans see the talker to listen better, so do machines. However, previous studies on audio-visual speaker extraction have not effectively handled the varying talking face. This paper studies how to take full advantage of the varying talking face. We propose a Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that incorporates an additional pose-invariant view to improve audio-visual speaker extraction. Specifically, we generate the pose-invariant view from each original pose orientation, which enables the model to receive a consistent frontal view of the talker regardless of his/her head pose, therefore, forming a multi-view visual input for the speaker. Experiments on the multi-view MEAD and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
