ImitDiff: Transferring Foundation-Model Priors for Distraction Robust Visuomotor Policy

Yuhang Dong; Haizhou Ge; Yupei Zeng; Jiangning Zhang; Beiwen Tian; Hongrui Zhu; Yufei Jia; Ruixiang Wang; Zhucun Xue; Guyue Zhou; Longhua Ma; Guanzhong Tian

arXiv:2502.09649·cs.AI·November 11, 2025

ImitDiff: Transferring Foundation-Model Priors for Distraction Robust Visuomotor Policy

Yuhang Dong, Haizhou Ge, Yupei Zeng, Jiangning Zhang, Beiwen Tian, Hongrui Zhu, Yufei Jia, Ruixiang Wang, Zhucun Xue, Guyue Zhou, Longhua Ma, Guanzhong Tian

PDF

Open Access

TL;DR

ImitDiff is a diffusion-based visuomotor policy that uses semantic masks from foundation models to improve robot manipulation robustness in complex and distracting visual environments.

Contribution

We introduce ImitDiff, a novel dual-resolution, diffusion-guided imitation learning framework leveraging vision-language priors for distraction-robust robot manipulation.

Findings

01

Outperforms state-of-the-art methods in complex scenes.

02

Shows strong zero-shot generalization to new objects and distractions.

03

Achieves faster inference with a new diffusion transformer action head.

Abstract

Visuomotor imitation learning policies enable robots to efficiently acquire manipulation skills from visual demonstrations. However, as scene complexity and visual distractions increase, policies that perform well in simple settings often experience substantial performance degradation. To address this challenge, we propose ImitDiff, a diffusion-based imitation learning policy guided by fine-grained semantics within a dual-resolution workflow. Leveraging pretrained priors of vision-language foundation models, our method transforms high-level instructions into pixel-level visual semantic masks. These masks guide a dual-resolution perception pipeline that captures both global context (e.g., overall layout) from low-resolution observation and fine-grained local features (e.g., geometric details) from high-resolution observation, enabling the policy to focus on task-relevant regions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition

MethodsDiffusion