A Multi-View Approach To Audio-Visual Speaker Verification
Leda Sar{\i}, Kritika Singh, Jiatong Zhou, Lorenzo Torresani, Nayan, Singhal, Yatharth Saraf

TL;DR
This paper explores audio-visual speaker verification, combining fusion techniques and a novel multi-view model to improve verification accuracy, especially in cross-modal scenarios, achieving state-of-the-art results on VoxCeleb1.
Contribution
It introduces a multi-view model for cross-modal verification, enabling audio and visual data to be verified against each other, which was not possible with previous fusion methods.
Findings
Achieved 0.7% AV EER on VoxCeleb1 with fusion techniques.
Achieved 28% EER in cross-modal verification with the multi-view model.
Demonstrated the effectiveness of shared classifiers for cross-modal speaker verification.
Abstract
Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using our best system. As these methods lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
