Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification
Aref Farhadipour, Masoumeh Chapariniya, Teodora Vukovic, Volker Dellwo

TL;DR
This paper compares three multimodal fusion strategies combining voice and face data for person identification and verification, demonstrating that feature fusion of gammatonegram and facial features yields the highest accuracy.
Contribution
It introduces and evaluates three different modality fusion approaches for audio-visual person identification and verification using deep learning models.
Findings
Feature fusion of gammatonegram and facial features achieves 98.37% accuracy.
Concatenating facial features with x-vectors results in 0.62% EER.
Multimodal strategies outperform single-modality approaches.
Abstract
Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality. In addition, gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network. The proposed systems are evaluated using the K-fold cross-validation technique on the 118 speakers of the test set of the VoxCeleb2 dataset. The comparative evaluations are done for single-modality and three proposed multimodal strategies in equal situations. Results demonstrate that the feature fusion strategy of gammatonegram and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Video Surveillance and Tracking Methods · Digital Media Forensic Detection
MethodsSparse Evolutionary Training
