ImitDiff: Transferring Foundation-Model Priors for Distraction Robust Visuomotor Policy
Yuhang Dong, Haizhou Ge, Yupei Zeng, Jiangning Zhang, Beiwen Tian, Hongrui Zhu, Yufei Jia, Ruixiang Wang, Zhucun Xue, Guyue Zhou, Longhua Ma, Guanzhong Tian

TL;DR
ImitDiff is a diffusion-based visuomotor policy that uses semantic masks from foundation models to improve robot manipulation robustness in complex and distracting visual environments.
Contribution
We introduce ImitDiff, a novel dual-resolution, diffusion-guided imitation learning framework leveraging vision-language priors for distraction-robust robot manipulation.
Findings
Outperforms state-of-the-art methods in complex scenes.
Shows strong zero-shot generalization to new objects and distractions.
Achieves faster inference with a new diffusion transformer action head.
Abstract
Visuomotor imitation learning policies enable robots to efficiently acquire manipulation skills from visual demonstrations. However, as scene complexity and visual distractions increase, policies that perform well in simple settings often experience substantial performance degradation. To address this challenge, we propose ImitDiff, a diffusion-based imitation learning policy guided by fine-grained semantics within a dual-resolution workflow. Leveraging pretrained priors of vision-language foundation models, our method transforms high-level instructions into pixel-level visual semantic masks. These masks guide a dual-resolution perception pipeline that captures both global context (e.g., overall layout) from low-resolution observation and fine-grained local features (e.g., geometric details) from high-resolution observation, enabling the policy to focus on task-relevant regions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition
MethodsDiffusion
