ARDuP: Active Region Video Diffusion for Universal Policies
Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu,, Linxi Fan, De-An Huang, Abhinav Shrivastava

TL;DR
ARDuP introduces a novel video diffusion framework that emphasizes active regions to improve policy learning for decision-making tasks, eliminating manual annotations and enhancing focus on interaction areas.
Contribution
The paper presents a new active region video diffusion approach that integrates active region conditioning with latent diffusion models for improved policy learning.
Findings
Significant success rate improvements in simulated tasks
Effective automatic active region discovery from motion cues
Generation of realistic and task-relevant video plans
Abstract
Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Coding and Compression Technologies · Advanced Image Processing Techniques · Image and Signal Denoising Methods
MethodsContrastive Language-Image Pre-training · CLIPort · Focus · Diffusion
