Audio-visual Speaker Recognition with a Cross-modal Discriminative   Network

Ruijie Tao; Rohan Kumar Das; Haizhou Li

arXiv:2008.03894·eess.AS·August 11, 2020·Interspeech·1 cites

Audio-visual Speaker Recognition with a Cross-modal Discriminative Network

Ruijie Tao, Rohan Kumar Das, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces VFNet, a cross-modal network leveraging voice-face relations to improve speaker recognition accuracy, achieving significant EER reduction on NIST SRE 2019 data.

Contribution

The study presents a novel voice-face discriminative network that enhances audio-visual speaker recognition by integrating cross-modal information.

Findings

01

VFNet provides additional speaker discriminative information.

02

Achieves 16.54% EER relative reduction over baseline.

03

Demonstrates effectiveness on NIST SRE 2019 dataset.

Abstract

Audio-visual speaker recognition is one of the tasks in the recent 2019 NIST speaker recognition evaluation (SRE). Studies in neuroscience and computer science all point to the fact that vision and auditory neural signals interact in the cognitive process. This motivated us to study a cross-modal network, namely voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis