From CNNs to Transformers in Multimodal Human Action Recognition: A   Survey

Muhammad Bilal Shaikh; Syed Mohammed Shamsul Islam; Douglas Chai and; Naveed Akhtar

arXiv:2405.15813·cs.CV·May 28, 2024

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Muhammad Bilal Shaikh, Syed Mohammed Shamsul Islam, Douglas Chai and, Naveed Akhtar

PDF

TL;DR

This survey reviews the transition from CNNs to Transformers in multimodal human action recognition, emphasizing fusion techniques, recent design trends, and future research directions in the field.

Contribution

It uniquely focuses on fusion design and recent architectural choices in MHAR, providing insights beyond broad human action recognition surveys.

Findings

01

Analysis of classic and emerging fusion techniques

02

Identification of recent efficient MHAR model designs

03

Discussion of multimodal datasets and evaluation methods

Abstract

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of "fusing" the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Focus · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer