TL;DR
This paper introduces DFAlign, a diffusion-based framework that generates foreground knowledge to improve open-vocabulary temporal action detection by enhancing cross-modal alignment.
Contribution
DFAlign is the first framework to use diffusion denoising for foreground knowledge generation, significantly improving action detection accuracy in untrimmed videos.
Findings
Achieves state-of-the-art results on OV-TAD benchmarks.
Effectively mitigates semantic noise and enhances cross-modal alignment.
Demonstrates the effectiveness of foreground knowledge in action detection.
Abstract
Open-Vocabulary Temporal Action Detection (OV-TAD) aims to localize and classify action segments of unseen categories in untrimmed videos, where effective alignment between action semantics and video representations is critical for accurate detection. However, existing methods struggle to mitigate the semantic imbalance between concise, abstract action labels and rich, complex video contents, inevitably introducing semantic noise and misleading cross-modal alignment. To address this challenge, we propose DFAlign, the first framework that leverages diffusion-based denoising to generate foreground knowledge for the guidance of action-video alignment. Following the 'conditioning, denoising and aligning' manner, we first introduce the Semantic-Unify Conditioning (SUC) module, which unifies action-shared and action-specific semantics as conditions for diffusion denoising. Then, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
