CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity

Guang Yin; Yitong Li; Yixuan Wang; Dale McConachie; Paarth Shah; Kunimatsu Hashimoto; Huan Zhang; Katherine Liu; Yunzhu Li

arXiv:2506.16652·cs.RO·June 23, 2025

CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity

Guang Yin, Yitong Li, Yixuan Wang, Dale McConachie, Paarth Shah, Kunimatsu Hashimoto, Huan Zhang, Katherine Liu, Yunzhu Li

PDF

Open Access

TL;DR

This paper presents CodeDiffuser, a framework that uses vision-language models to interpret ambiguous instructions and generate executable code, improving robotic manipulation performance and interpretability.

Contribution

It introduces a novel approach combining VLM-generated code with attention mechanisms to handle instruction ambiguity in robotic tasks.

Findings

01

Outperforms existing methods on complex manipulation tasks

02

Effectively resolves language ambiguities using attention-enhanced code

03

Shows robustness to environmental variations and multi-object interactions

Abstract

Natural language instructions for robotic manipulation tasks often exhibit ambiguity and vagueness. For instance, the instruction "Hang a mug on the mug tree" may involve multiple valid actions if there are several mugs and branches to choose from. Existing language-conditioned policies typically rely on end-to-end models that jointly handle high-level semantic understanding and low-level action generation, which can result in suboptimal performance due to their lack of modularity and interpretability. To address these challenges, we introduce a novel robotic manipulation framework that can accomplish tasks specified by potentially ambiguous natural language. This framework employs a Vision-Language Model (VLM) to interpret abstract concepts in natural language instructions and generates task-specific code - an interpretable and executable intermediate representation. The generated code…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics