MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Muhammad Bilal Shaikh; Douglas Chai; Syed Mohammed Shamsul Islam and; Naveed Akhtar

arXiv:2209.04780·cs.CV·January 20, 2023

MAiVAR: Multimodal Audio-Image and Video Action Recognizer

Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and, Naveed Akhtar

PDF

Open Access

TL;DR

MAiVAR introduces a CNN-based multimodal approach that combines audio-image representations with video data to enhance action recognition accuracy beyond single-modality methods.

Contribution

This paper presents MAiVAR, a novel CNN-based fusion model that integrates audio-image and video data for improved multimodal action recognition.

Findings

01

MAiVAR outperforms single-modality models on a large-scale dataset.

02

Fusion of audio-image and video modalities yields superior recognition accuracy.

03

The approach demonstrates effective multimodal integration for action recognition.

Abstract

Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task. To this end, we propose Multimodal Audio-Image and Video Action Recognizer (MAiVAR), a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance. MAiVAR extracts meaningful image representations of audio and fuses it with video representation to achieve better performance as compared to both modalities individually on a large-scale action recognition dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Hand Gesture Recognition Systems