End-to-End Temporal Action Detection with 1B Parameters Across 1000   Frames

Shuming Liu; Chen-Lin Zhang; Chen Zhao; Bernard Ghanem

arXiv:2311.17241·cs.CV·April 23, 2024·1 cites

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Shuming Liu, Chen-Lin Zhang, Chen Zhao, Bernard Ghanem

PDF

Open Access 2 Repos

TL;DR

This paper introduces a scalable end-to-end temporal action detection model with a 1 billion parameter backbone and 1536 frames input, utilizing a novel lightweight adapter to reduce memory and improve detection performance.

Contribution

The paper proposes the temporal-informative adapter (TIA), enabling large-scale end-to-end training of TAD models with reduced memory requirements and enhanced temporal context aggregation.

Findings

01

Achieved 75.4% mAP on THUMOS14 dataset.

02

First end-to-end model to outperform feature-based methods on VideoMAEv2-giant.

03

Successfully scaled up TAD backbone to 1 billion parameters.

Abstract

Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications

MethodsAdapter