End-to-end Temporal Action Detection with Transformer

Xiaolong Liu; Qimeng Wang; Yao Hu; Xu Tang; Shiwei Zhang; Song Bai,; Xiang Bai

arXiv:2106.10271·cs.CV·August 12, 2022

End-to-end Temporal Action Detection with Transformer

Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, Song Bai,, Xiang Bai

PDF

1 Repo

TL;DR

This paper introduces TadTR, an end-to-end Transformer-based approach for temporal action detection in videos, which simplifies the pipeline, reduces computation, and achieves state-of-the-art results on multiple benchmarks.

Contribution

The paper proposes a novel Transformer-based method for TAD that is end-to-end trainable, with improvements for locality awareness and a deformable attention module.

Findings

01

Achieves state-of-the-art performance on THUMOS14 and HACS Segments datasets.

02

Requires lower computation cost than previous methods.

03

Effective boundary refinement and confidence prediction mechanisms.

Abstract

Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xlliu7/TadTR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Deformable Attention Module · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Dropout · Multi-Head Attention · Layer Normalization