PGT: A Progressive Method for Training Models on Long Videos
Bo Pang, Gao Peng, Yizhuo Li, Cewu Lu

TL;DR
The paper introduces PGT, a progressive training method that enables end-to-end training of long videos by propagating information sequentially, overcoming computational limitations of traditional clip-based approaches.
Contribution
It proposes a novel progressive training approach inspired by NLP techniques, allowing effective end-to-end training of long videos with limited resources.
Findings
Improves SlowOnly network by 3.7 mAP on Charades
Increases top-1 accuracy by 1.9 on Kinetics
Achieves significant performance gains with negligible overhead
Abstract
Convolutional video models have an order of magnitude larger computational complexity than their counterpart image-level models. Constrained by computational resources, there is no model or training method that can train long video sequences end-to-end. Currently, the main-stream method is to split a raw video into clips, leading to incomplete fragmentary temporal information flow. Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property, and train it as a whole by progressively propagating information through the temporal dimension in multiple steps. This progressive training (PGT) method is able to train long videos end-to-end with limited resources and ensures the effective transmission of information. As a general and robust training method, we empirically demonstrate that it yields…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Advanced Vision and Imaging
