Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection

Sa Zhu; Wanqian Zhang; Lin Wang; Jinchao Zhang; Cong Wang; Bo Li

arXiv:2604.18313·cs.CV·April 21, 2026

Denoise and Align: Diffusion-Driven Foreground Knowledge Prompting for Open-Vocabulary Temporal Action Detection

Sa Zhu, Wanqian Zhang, Lin Wang, Jinchao Zhang, Cong Wang, Bo Li

PDF

1 Repo

TL;DR

This paper introduces DFAlign, a diffusion-based framework that generates foreground knowledge to improve open-vocabulary temporal action detection by enhancing cross-modal alignment.

Contribution

DFAlign is the first framework to use diffusion denoising for foreground knowledge generation, significantly improving action detection accuracy in untrimmed videos.

Findings

01

Achieves state-of-the-art results on OV-TAD benchmarks.

02

Effectively mitigates semantic noise and enhances cross-modal alignment.

03

Demonstrates the effectiveness of foreground knowledge in action detection.

Abstract

Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://anonymous.4open.science/r/Code-2114
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.