VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation

Kefeng Huang; Tingguang Li; Yuzhen Liu; Zhe Zhang; Jiankun Wang; Lei Han

arXiv:2507.04524·cs.RO·July 8, 2025

VLM-TDP: VLM-guided Trajectory-conditioned Diffusion Policy for Robust Long-Horizon Manipulation

Kefeng Huang, Tingguang Li, Yuzhen Liu, Zhe Zhang, Jiankun Wang, Lei Han

PDF

TL;DR

This paper introduces VLM-TDP, a novel diffusion policy guided by vision-language models that decomposes complex long-horizon robotic tasks into manageable sub-tasks, significantly improving robustness and success rates especially under noisy conditions.

Contribution

The paper presents a new VLM-guided trajectory-conditioned diffusion policy that enhances long-horizon manipulation performance and robustness against image noise, outperforming classical diffusion methods.

Findings

01

44% increase in success rate

02

Over 100% improvement in long-horizon tasks

03

20% reduction in performance degradation under noise

Abstract

Diffusion policy has demonstrated promising performance in the field of robotic manipulation. However, its effectiveness has been primarily limited in short-horizon tasks, and its performance significantly degrades in the presence of image noise. To address these limitations, we propose a VLM-guided trajectory-conditioned diffusion policy (VLM-TDP) for robust and long-horizon manipulation. Specifically, the proposed method leverages state-of-the-art vision-language models (VLMs) to decompose long-horizon tasks into concise, manageable sub-tasks, while also innovatively generating voxel-based trajectories for each sub-task. The generated trajectories serve as a crucial conditioning factor, effectively steering the diffusion policy and substantially enhancing its performance. The proposed Trajectory-conditioned Diffusion Policy (TDP) is trained on trajectories derived from demonstration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.