TL;DR
This paper introduces TadTR, an end-to-end Transformer-based approach for temporal action detection in videos, which simplifies the pipeline, reduces computation, and achieves state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes a novel Transformer-based method for TAD that is end-to-end trainable, with improvements for locality awareness and a deformable attention module.
Findings
Achieves state-of-the-art performance on THUMOS14 and HACS Segments datasets.
Requires lower computation cost than previous methods.
Effective boundary refinement and confidence prediction mechanisms.
Abstract
Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Deformable Attention Module · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Dropout · Multi-Head Attention · Layer Normalization
