Learning Audio-Visual embedding for Person Verification in the Wild
Peiwen Sun, Shanshan Zhang, Zishan Liu, Yougen Yuan, Taotao Zhang,, Honggang Zhang, Pengfei Hu

TL;DR
This paper introduces a novel audio-visual embedding approach for person verification that leverages advanced pooling and fusion techniques, achieving state-of-the-art accuracy on VoxCeleb benchmarks.
Contribution
It proposes a new weight-enhanced attentive pooling method and a joint attentive pooling with cycle consistency for improved audio-visual fusion in person verification.
Findings
Achieved the lowest EER on VoxCeleb1 trial lists.
Introduced cycle consistency in attentive pooling.
Demonstrated robustness of audio-visual embeddings.
Abstract
It has already been observed that audio-visual embedding is more robust than uni-modality embedding for person verification. Here, we proposed a novel audio-visual strategy that considers aggregators from a fusion perspective. First, we introduced weight-enhanced attentive statistics pooling for the first time in face verification. We find that a strong correlation exists between modalities during pooling, so joint attentive pooling is proposed which contains cycle consistency to learn the implicit inter-frame weight. Finally, each modality is fused with a gated attention mechanism to gain robust audio-visual embedding. All the proposed models are trained on the VoxCeleb2 dev dataset and the best system obtains 0.18%, 0.27%, and 0.49% EER on three official trial lists of VoxCeleb1 respectively, which is to our knowledge the best-published results for person verification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Video Surveillance and Tracking Methods · Generative Adversarial Networks and Image Synthesis
