Learning Audio-Visual embedding for Person Verification in the Wild

Peiwen Sun; Shanshan Zhang; Zishan Liu; Yougen Yuan; Taotao Zhang,; Honggang Zhang; Pengfei Hu

arXiv:2209.04093·cs.CV·October 27, 2022

Learning Audio-Visual embedding for Person Verification in the Wild

Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang,, Honggang Zhang, Pengfei Hu

PDF

Open Access

TL;DR

This paper introduces a novel audio-visual embedding approach for person verification that leverages advanced pooling and fusion techniques, achieving state-of-the-art accuracy on VoxCeleb benchmarks.

Contribution

It proposes a new weight-enhanced attentive pooling method and a joint attentive pooling with cycle consistency for improved audio-visual fusion in person verification.

Findings

01

Achieved the lowest EER on VoxCeleb1 trial lists.

02

Introduced cycle consistency in attentive pooling.

03

Demonstrated robustness of audio-visual embeddings.

Abstract

It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Video Surveillance and Tracking Methods · Generative Adversarial Networks and Image Synthesis