Comparative Analysis of Modality Fusion Approaches for Audio-Visual   Person Identification and Verification

Aref Farhadipour; Masoumeh Chapariniya; Teodora Vukovic; Volker Dellwo

arXiv:2409.00562·eess.AS·November 5, 2024·2 cites

Comparative Analysis of Modality Fusion Approaches for Audio-Visual Person Identification and Verification

Aref Farhadipour, Masoumeh Chapariniya, Teodora Vukovic, Volker Dellwo

PDF

Open Access

TL;DR

This paper compares three multimodal fusion strategies combining voice and face data for person identification and verification, demonstrating that feature fusion of gammatonegram and facial features yields the highest accuracy.

Contribution

It introduces and evaluates three different modality fusion approaches for audio-visual person identification and verification using deep learning models.

Findings

01

Feature fusion of gammatonegram and facial features achieves 98.37% accuracy.

02

Concatenating facial features with x-vectors results in 0.62% EER.

03

Multimodal strategies outperform single-modality approaches.

Abstract

Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from voice, while the pre-trained VGGFace2 network and transfer learning are utilized for face modality. In addition, gammatonegram is used as speech representation in engagement with the Darknet19 pre-trained network. The proposed systems are evaluated using the K-fold cross-validation technique on the 118 speakers of the test set of the VoxCeleb2 dataset. The comparative evaluations are done for single-modality and three proposed multimodal strategies in equal situations. Results demonstrate that the feature fusion strategy of gammatonegram and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Video Surveillance and Tracking Methods · Digital Media Forensic Detection

MethodsSparse Evolutionary Training