From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh, Syed Mohammed Shamsul Islam, Douglas Chai and, Naveed Akhtar

TL;DR
This survey reviews the transition from CNNs to Transformers in multimodal human action recognition, emphasizing fusion techniques, recent design trends, and future research directions in the field.
Contribution
It uniquely focuses on fusion design and recent architectural choices in MHAR, providing insights beyond broad human action recognition surveys.
Findings
Analysis of classic and emerging fusion techniques
Identification of recent efficient MHAR model designs
Discussion of multimodal datasets and evaluation methods
Abstract
Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of "fusing" the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Focus · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer
