ARDuP: Active Region Video Diffusion for Universal Policies

Shuaiyi Huang; Mara Levy; Zhenyu Jiang; Anima Anandkumar; Yuke Zhu,; Linxi Fan; De-An Huang; Abhinav Shrivastava

arXiv:2406.13301·cs.CV·January 31, 2025

ARDuP: Active Region Video Diffusion for Universal Policies

Shuaiyi Huang, Mara Levy, Zhenyu Jiang, Anima Anandkumar, Yuke Zhu,, Linxi Fan, De-An Huang, Abhinav Shrivastava

PDF

Open Access

TL;DR

ARDuP introduces a novel video diffusion framework that emphasizes active regions to improve policy learning for decision-making tasks, eliminating manual annotations and enhancing focus on interaction areas.

Contribution

The paper presents a new active region video diffusion approach that integrates active region conditioning with latent diffusion models for improved policy learning.

Findings

01

Significant success rate improvements in simulated tasks

02

Effective automatic active region discovery from motion cues

03

Generation of realistic and task-relevant video plans

Abstract

Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Coding and Compression Technologies · Advanced Image Processing Techniques · Image and Signal Denoising Methods

MethodsContrastive Language-Image Pre-training · CLIPort · Focus · Diffusion