DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer

Ho-Joong Kim; Yearang Lee; Jung-Ho Hong; Seong-Whan Lee

arXiv:2505.05711·cs.CV·May 12, 2025

DiGIT: Multi-Dilated Gated Encoder and Central-Adjacent Region Integrated Decoder for Temporal Action Detection Transformer

Ho-Joong Kim, Yearang Lee, Jung-Ho Hong, Seong-Whan Lee

PDF

Open Access 1 Repo

TL;DR

DiGIT introduces a novel multi-dilated gated encoder and a central-adjacent region integrated decoder to improve temporal action detection by reducing feature redundancy and enhancing temporal context understanding, achieving state-of-the-art results.

Contribution

The paper proposes a new encoder and decoder architecture specifically designed for TAD, addressing limitations of existing query-based detectors and improving performance on benchmark datasets.

Findings

01

Achieves state-of-the-art results on THUMOS14, ActivityNet v1.3, and HACS-Segment.

02

Reduces feature redundancy while capturing fine-grained and long-range temporal information.

03

Demonstrates effectiveness of the proposed architecture through extensive experiments.

Abstract

In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dotori-hj/digit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Context-Aware Activity Recognition Systems

MethodsSoftmax · Attention Is All You Need · Dense Connections · Feedforward Network