Speaker Recognition in Realistic Scenario Using Multimodal Data
Saqlain Hussain Shah, Muhammad Saad Saeed, Shah Nawaz, Muhammad Haroon, Yousaf

TL;DR
This paper introduces a multimodal system combining face and voice data using a two-branch neural network to enhance speaker recognition accuracy on large-scale datasets, demonstrating the benefit of facial information.
Contribution
The paper presents a novel two-branch neural network architecture that jointly learns face and voice representations for improved speaker recognition.
Findings
Facial information improves speaker recognition performance.
Overlap exists between face and voice features.
The framework performs well on VoxCeleb1 dataset.
Abstract
In recent years, an association is established between faces and voices of celebrities leveraging large scale audio-visual information from YouTube. The availability of large scale audio-visual datasets is instrumental in developing speaker recognition methods based on standard Convolutional Neural Networks. Thus, the aim of this paper is to leverage large scale audio-visual information to improve speaker recognition task. To achieve this task, we proposed a two-branch network to learn joint representations of faces and voices in a multimodal system. Afterwards, features are extracted from the two-branch network to train a classifier for speaker recognition. We evaluated our proposed framework on a large scale audio-visual dataset named VoxCeleb. Our results show that addition of facial information improved the performance of speaker recognition. Moreover, our results indicate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
