A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition
Xiuliang Zhang, Tadiwa Elisha Nyamasvisva, Chuntao Liu

TL;DR
This paper introduces a hybrid framework combining 3D CNN and Transformer models to improve video-based behavior recognition by capturing both local and global features more effectively.
Contribution
It proposes a novel hybrid architecture that integrates 3D CNN and Transformer modules, enhancing long-range dependency modeling in video analysis.
Findings
Outperforms traditional 3D CNN models in accuracy
Achieves higher recognition performance than standalone Transformers
Validated through ablation studies confirming module complementarity
Abstract
Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Context-Aware Activity Recognition Systems
