DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer
Ho-Joong Kim, Yearang Lee, Jung-Ho Hong, Seong-Whan Lee

TL;DR
DiGIT introduces a novel multi-dilated gated encoder and a central-adjacent region integrated decoder to improve temporal action detection by reducing feature redundancy and enhancing temporal context understanding, achieving state-of-the-art results.
Contribution
The paper proposes a new encoder and decoder architecture specifically designed for TAD, addressing limitations of existing query-based detectors and improving performance on benchmark datasets.
Findings
Achieves state-of-the-art results on THUMOS14, ActivityNet v1.3, and HACS-Segment.
Reduces feature redundancy while capturing fine-grained and long-range temporal information.
Demonstrates effectiveness of the proposed architecture through extensive experiments.
Abstract
In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Context-Aware Activity Recognition Systems
MethodsSoftmax · Attention Is All You Need · Dense Connections · Feedforward Network
