Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers
Youngjun Jun, Seil Kang, Woojung Han, Seong Jae Hwang

TL;DR
This paper introduces a novel method for localizing motion concepts in videos generated by Diffusion Transformers, providing interpretable saliency maps without requiring gradient updates, and demonstrating superior localization and segmentation performance.
Contribution
It proposes GramCol for spatial localization and a motion-feature selection algorithm to create Interpretable Motion-Attentive Maps, advancing interpretability in video diffusion models.
Findings
Effective motion localization in videos
Zero-shot video semantic segmentation capability
Clearer, interpretable saliency maps for motion and non-motion concepts
Abstract
Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
