A Survey on Backbones for Deep Video Action Recognition
Zixuan Tang, Youjun Zhao, Yuhang Wen, Mengyuan Liu

TL;DR
This survey reviews diverse deep learning backbones for video action recognition, including two-stream, 3D CNN, and transformer-based methods, highlighting their architectures, challenges, and future directions.
Contribution
It provides a comprehensive overview of current deep neural network backbones for action recognition, comparing their approaches and identifying research gaps.
Findings
Two-stream networks utilize RGB and optical flow modalities.
3D CNNs directly extract motion features from RGB videos.
Transformer-based models introduce NLP techniques into video understanding.
Abstract
Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
