GeoDeformer: Geometric Deformable Transformer for Action Recognition
Jinhui Ye, Jiaming Zhou, Hui Xiong, Junwei Liang

TL;DR
GeoDeformer enhances vision transformers for action recognition by explicitly modeling geometric deformations, leading to improved accuracy on standard datasets with minimal additional computational cost.
Contribution
The paper introduces GeoDeformer, a novel module that captures spatial and temporal geometric variations within video data, integrated into existing ViT architectures for better action recognition.
Findings
Achieves state-of-the-art accuracy on UCF101, HMDB51, Mini-K200 datasets.
Effectively models geometric deformations in videos.
Minimal increase in computational cost.
Abstract
Vision transformers have recently emerged as an effective alternative to convolutional networks for action recognition. However, vision transformers still struggle with geometric variations prevalent in video data. This paper proposes a novel approach, GeoDeformer, designed to capture the variations inherent in action video by integrating geometric comprehension directly into the ViT architecture. Specifically, at the core of GeoDeformer is the Geometric Deformation Predictor, a module designed to identify and quantify potential spatial and temporal geometric deformations within the given video. Spatial deformations adjust the geometry within individual frames, while temporal deformations capture the cross-frame geometric dynamics, reflecting motion and temporal progression. To demonstrate the effectiveness of our approach, we incorporate it into the established MViTv2 framework,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Medical Imaging and Analysis
