A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition

Xiuliang Zhang; Tadiwa Elisha Nyamasvisva; Chuntao Liu

arXiv:2508.06528·cs.CV·August 12, 2025

A Framework Combining 3D CNN and Transformer for Video-Based Behavior Recognition

Xiuliang Zhang, Tadiwa Elisha Nyamasvisva, Chuntao Liu

PDF

Open Access

TL;DR

This paper introduces a hybrid framework combining 3D CNN and Transformer models to improve video-based behavior recognition by capturing both local and global features more effectively.

Contribution

It proposes a novel hybrid architecture that integrates 3D CNN and Transformer modules, enhancing long-range dependency modeling in video analysis.

Findings

01

Outperforms traditional 3D CNN models in accuracy

02

Achieves higher recognition performance than standalone Transformers

03

Validated through ablation studies confirming module complementarity

Abstract

Video-based behavior recognition is essential in fields such as public safety, intelligent surveillance, and human-computer interaction. Traditional 3D Convolutional Neural Network (3D CNN) effectively capture local spatiotemporal features but struggle with modeling long-range dependencies. Conversely, Transformers excel at learning global contextual information but face challenges with high computational costs. To address these limitations, we propose a hybrid framework combining 3D CNN and Transformer architectures. The 3D CNN module extracts low-level spatiotemporal features, while the Transformer module captures long-range temporal dependencies, with a fusion mechanism integrating both representations. Evaluated on benchmark datasets, the proposed model outperforms traditional 3D CNN and standalone Transformers, achieving higher recognition accuracy with manageable complexity.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Context-Aware Activity Recognition Systems