MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers
Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and, Naveed Akhtar

TL;DR
MAiVAR-T is a novel transformer-based model that integrates audio and video modalities for improved human action recognition, demonstrating superior performance over existing methods through extensive empirical evaluation.
Contribution
The paper introduces MAiVAR-T, a new multimodal transformer model that effectively combines audio and image representations for enhanced action recognition.
Findings
Outperforms state-of-the-art methods on benchmark datasets
Effectively integrates audio and video modalities
Shows significant improvement in action recognition accuracy
Abstract
In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Anomaly Detection Techniques and Applications
MethodsFocus
