Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
Ruijie Tao, Rohan Kumar Das, Haizhou Li

TL;DR
This paper introduces VFNet, a cross-modal network leveraging voice-face relations to improve speaker recognition accuracy, achieving significant EER reduction on NIST SRE 2019 data.
Contribution
The study presents a novel voice-face discriminative network that enhances audio-visual speaker recognition by integrating cross-modal information.
Findings
VFNet provides additional speaker discriminative information.
Achieves 16.54% EER relative reduction over baseline.
Demonstrates effectiveness on NIST SRE 2019 dataset.
Abstract
Audio-visual speaker recognition is one of the tasks in the recent 2019 NIST speaker recognition evaluation (SRE). Studies in neuroscience and computer science all point to the fact that vision and auditory neural signals interact in the cognitive process. This motivated us to study a cross-modal network, namely voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
