HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition   Evaluation

Rohan Kumar Das; Ruijie Tao; Jichen Yang; Wei Rao; Cheng; Yu; Haizhou Li

arXiv:2010.03905·eess.AS·October 9, 2020·1 cites

HLT-NUS Submission for NIST 2019 Multimedia Speaker Recognition Evaluation

Rohan Kumar Das, Ruijie Tao, Jichen Yang, Wei Rao, Cheng, Yu, Haizhou Li

PDF

Open Access

TL;DR

This paper presents a multimodal speaker verification system combining audio and visual data, achieving state-of-the-art accuracy in the 2019 NIST Multimedia SRE by fusing x-vector and face recognition techniques.

Contribution

The work introduces a multimodal speaker verification approach with separate audio and visual systems and score-level fusion, tailored for the NIST 2019 multimedia challenge.

Findings

01

Achieved EER of 0.88% on the evaluation set

02

Achieved actDCF of 0.026 on the evaluation set

03

Demonstrated effectiveness of multimodal fusion in speaker recognition

Abstract

This work describes the speaker verification system developed by Human Language Technology Laboratory, National University of Singapore (HLT-NUS) for 2019 NIST Multimedia Speaker Recognition Evaluation (SRE). The multimedia research has gained attention to a wide range of applications and speaker recognition is no exception to it. In contrast to the previous NIST SREs, the latest edition focuses on a multimedia track to recognize speakers with both audio and visual information. We developed separate systems for audio and visual inputs followed by a score level fusion of the systems from the two modalities to collectively use their information. The audio systems are based on x-vector based speaker embedding, whereas the face recognition systems are based on ResNet and InsightFace based face embeddings. With post evaluation studies and refinements, we obtain an equal error rate (EER) of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing