Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Youngjun Jun; Seil Kang; Woojung Han; Seong Jae Hwang

arXiv:2603.02919·cs.CV·March 10, 2026

Interpretable Motion-Attentive Maps: Spatio-Temporally Localizing Concepts in Video Diffusion Transformers

Youngjun Jun, Seil Kang, Woojung Han, Seong Jae Hwang

PDF

Open Access

TL;DR

This paper introduces a novel method for localizing motion concepts in videos generated by Diffusion Transformers, providing interpretable saliency maps without requiring gradient updates, and demonstrating superior localization and segmentation performance.

Contribution

It proposes GramCol for spatial localization and a motion-feature selection algorithm to create Interpretable Motion-Attentive Maps, advancing interpretability in video diffusion models.

Findings

01

Effective motion localization in videos

02

Zero-shot video semantic segmentation capability

03

Clearer, interpretable saliency maps for motion and non-motion concepts

Abstract

Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis