TL;DR
This paper introduces VTN, a transformer-based framework for video recognition that processes entire videos efficiently, achieving faster training and inference while maintaining competitive accuracy, and serving as a new baseline for future research.
Contribution
The paper proposes a generic transformer-based approach for video recognition that replaces 3D ConvNets, enabling faster training and inference with competitive accuracy.
Findings
Trains 16.1 times faster and runs 5.1 times faster during inference.
Requires 1.5 times fewer GFLOPs compared to other methods.
Achieves competitive results on Kinetics-400.
Abstract
This paper presents VTN, a transformer-based framework for video recognition. Inspired by recent developments in vision transformers, we ditch the standard approach in video action recognition that relies on 3D ConvNets and introduce a method that classifies actions by attending to the entire video sequence information. Our approach is generic and builds on top of any given 2D spatial network. In terms of wall runtime, it trains faster and runs faster during inference while maintaining competitive accuracy compared to other state-of-the-art methods. It enables whole video analysis, via a single end-to-end pass, while requiring fewer GFLOPs. We report competitive results on Kinetics-400 and present an ablation study of VTN properties and the trade-off between accuracy and inference speed. We hope our approach will serve as a new baseline and start a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
