DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion
Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang

TL;DR
DiffTAD introduces a generative diffusion-based approach for temporal action detection, transforming proposal generation into a denoising process that improves accuracy and convergence over traditional discriminative methods.
Contribution
This paper presents a novel diffusion-based framework for TAD, utilizing a Transformer decoder and a new inference acceleration technique, outperforming existing methods.
Findings
Achieves top performance on ActivityNet and THUMOS datasets.
Introduces a proposal denoising diffusion process for TAD.
Demonstrates faster training convergence with the proposed design.
Abstract
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Dropout · Dense Connections
