MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using   Transformers

Muhammad Bilal Shaikh; Douglas Chai; Syed Mohammed Shamsul Islam and; Naveed Akhtar

arXiv:2308.03741·cs.CV·August 8, 2023

MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

Muhammad Bilal Shaikh, Douglas Chai, Syed Mohammed Shamsul Islam and, Naveed Akhtar

PDF

Open Access

TL;DR

MAiVAR-T is a novel transformer-based model that integrates audio and video modalities for improved human action recognition, demonstrating superior performance over existing methods through extensive empirical evaluation.

Contribution

The paper introduces MAiVAR-T, a new multimodal transformer model that effectively combines audio and image representations for enhanced action recognition.

Findings

01

Outperforms state-of-the-art methods on benchmark datasets

02

Effectively integrates audio and video modalities

03

Shows significant improvement in action recognition accuracy

Abstract

In line with the human capacity to perceive the world by simultaneously processing and integrating high-dimensional inputs from multiple modalities like vision and audio, we propose a novel model, MAiVAR-T (Multimodal Audio-Image to Video Action Recognition Transformer). This model employs an intuitive approach for the combination of audio-image and video modalities, with a primary aim to escalate the effectiveness of multimodal human action recognition (MHAR). At the core of MAiVAR-T lies the significance of distilling substantial representations from the audio modality and transmuting these into the image domain. Subsequently, this audio-image depiction is fused with the video modality to formulate a unified representation. This concerted approach strives to exploit the contextual richness inherent in both audio and video modalities, thereby promoting action recognition. In contrast…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Hand Gesture Recognition Systems · Anomaly Detection Techniques and Applications

MethodsFocus