DISCO: Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting
Ce Hao, Kelvin Lin, Zhiwei Xue, Siyuan Luo, Harold Soh

TL;DR
DISCO is a novel framework that combines vision-language models and diffusion policies to enable robots to understand and execute open-vocabulary manipulation tasks guided by natural language, improving generalization and zero-shot performance.
Contribution
DISCO introduces a method to translate natural language instructions into 3D keyframes using VLMs and guides diffusion policies through constrained inpainting with an optimization strategy, enhancing open-vocabulary manipulation.
Findings
Outperforms fine-tuned policies in zero-shot tasks
Demonstrates effective generalization in real-world scenarios
Balances keyframe adherence with learned motion priors
Abstract
Diffusion policies have demonstrated strong performance in generative modeling, making them promising for robotic manipulation guided by natural language instructions. However, generalizing language-conditioned diffusion policies to open-vocabulary instructions in everyday scenarios remains challenging due to the scarcity and cost of robot demonstration datasets. To address this, we propose DISCO, a framework that leverages off-the-shelf vision-language models (VLMs) to bridge natural language understanding with high-performance diffusion policies. DISCO translates linguistic task descriptions into actionable 3D keyframes using VLMs, which then guide the diffusion process through constrained inpainting. However, enforcing strict adherence to these keyframes can degrade performance when the VLM-generated keyframes are inaccurate. To mitigate this, we introduce an inpainting optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Robot Manipulation and Learning · Hate Speech and Cyberbullying Detection
