End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem

TL;DR
This paper introduces a scalable end-to-end temporal action detection model with a 1 billion parameter backbone and 1536 frames input, utilizing a novel lightweight adapter to reduce memory and improve detection performance.
Contribution
The paper proposes the temporal-informative adapter (TIA), enabling large-scale end-to-end training of TAD models with reduced memory requirements and enhanced temporal context aggregation.
Findings
Achieved 75.4% mAP on THUMOS14 dataset.
First end-to-end model to outperform feature-based methods on VideoMAEv2-giant.
Successfully scaled up TAD backbone to 1 billion parameters.
Abstract
Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsAdapter
