Flatten: Video Action Recognition is an Image Classification task

Junlin Chen; Chengcheng Xu; Yangfan Xu; Jian Yang; Jun Li; Zhiping Shi

arXiv:2408.09220·cs.CV·August 20, 2024·3 cites

Flatten: Video Action Recognition is an Image Classification task

Junlin Chen, Chengcheng Xu, Yangfan Xu, Jian Yang, Jun Li, Zhiping Shi

PDF

Open Access

TL;DR

This paper introduces Flatten, a novel module that transforms 3D video data into 2D representations, enabling the use of image classification models for efficient video action recognition with improved performance.

Contribution

Flatten provides a simple, plug-and-play solution to adapt image understanding models for video recognition by converting spatiotemporal data into 2D, reducing complexity and enhancing accuracy.

Findings

01

Significant performance improvements on Kinetics-400, Something-Something v2, and HMDB-51 datasets.

02

Effective integration with models like ResNet, SwinV2, and Uniformer.

03

Simplifies video recognition by leveraging existing image classification architectures.

Abstract

In recent years, video action recognition, as a fundamental task in the field of video understanding, has been deeply explored by numerous researchers.Most traditional video action recognition methods typically involve converting videos into three-dimensional data that encapsulates both spatial and temporal information, subsequently leveraging prevalent image understanding models to model and analyze these data. However,these methods have significant drawbacks. Firstly, when delving into video action recognition tasks, image understanding models often need to be adapted accordingly in terms of model architecture and preprocessing for these spatiotemporal tasks; Secondly, dealing with high-dimensional data often poses greater challenges and incurs higher time costs compared to its lower-dimensional counterparts.To bridge the gap between image-understanding and video-understanding tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods