DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with Feature selection
Ehsan Asali, Farzan Shenavarmasouleh, Farid Ghareh Mohammadi, Prasanth, Sengadu Suresh, and Hamid R. Arabnia

TL;DR
DeepMSRF is a novel deep learning framework that fuses audio and facial features using a two-stream VGGNET to improve speaker recognition accuracy in videos, demonstrating superior performance over single modality methods.
Contribution
The paper introduces DeepMSRF, a new multimodal fusion framework with feature selection for speaker recognition combining audio and face data.
Findings
DeepMSRF outperforms single modality methods by at least 3% accuracy.
Effective multimodal fusion enhances speaker recognition performance.
The framework successfully recognizes gender and identity in video streams.
Abstract
For recognizing speakers in video streams, significant research studies have been made to obtain a rich machine learning model by extracting high-level speaker's features such as facial expression, emotion, and gender. However, generating such a model is not feasible by using only single modality feature extractors that exploit either audio signals or image frames, extracted from video streams. In this paper, we address this problem from a different perspective and propose an unprecedented multimodality data fusion framework called DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection. We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images. DeepMSRF uses a two-stream VGGNET to train on both modalities to reach a comprehensive model capable of accurately recognizing the speaker's identity. We apply DeepMSRF on a subset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
