Flatten: Video Action Recognition is an Image Classification task
Junlin Chen, Chengcheng Xu, Yangfan Xu, Jian Yang, Jun Li, Zhiping Shi

TL;DR
This paper introduces Flatten, a novel module that transforms 3D video data into 2D representations, enabling the use of image classification models for efficient video action recognition with improved performance.
Contribution
Flatten provides a simple, plug-and-play solution to adapt image understanding models for video recognition by converting spatiotemporal data into 2D, reducing complexity and enhancing accuracy.
Findings
Significant performance improvements on Kinetics-400, Something-Something v2, and HMDB-51 datasets.
Effective integration with models like ResNet, SwinV2, and Uniformer.
Simplifies video recognition by leveraging existing image classification architectures.
Abstract
In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional counterparts.To bridge the gap between image-understanding and video-understanding tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods
