MAiVAR: Multimodal Audio-Image and Video Action Recognizer
Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and, Naveed Akhtar

TL;DR
MAiVAR introduces a CNN-based multimodal approach that combines audio-image representations with video data to enhance action recognition accuracy beyond single-modality methods.
Contribution
This paper presents MAiVAR, a novel CNN-based fusion model that integrates audio-image and video data for improved multimodal action recognition.
Findings
MAiVAR outperforms single-modality models on a large-scale dataset.
Fusion of audio-image and video modalities yields superior recognition accuracy.
The approach demonstrates effective multimodal integration for action recognition.
Abstract
Currently, action recognition is predominately performed on video data as processed by CNNs. We investigate if the representation process of CNNs can also be leveraged for multimodal action recognition by incorporating image-based audio representations of actions in a task. To this end, we propose Multimodal Audio-Image and Video Action Recognizer (MAiVAR), a CNN-based audio-image to video fusion model that accounts for video and audio modalities to achieve superior action recognition performance. MAiVAR extracts meaningful image representations of audio and fuses it with video representation to achieve better performance as compared to both modalities individually on a large-scale action recognition dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Hand Gesture Recognition Systems
