CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Max Fu; Justin Yu; Karim El-Refai; Ethan Kou; Haoru Xue; Huang Huang; Wenli Xiao; Guanzhi Wang; Fei-Fei Li; Guanya Shi; Jiajun Wu; Shankar Sastry; Yuke Zhu; Ken Goldberg; Linxi "Jim" Fan

arXiv:2603.22435·cs.RO·March 25, 2026

CaP-X: A Framework for Benchmarking and Improving Coding Agents for Robot Manipulation

Max Fu, Justin Yu, Karim El-Refai, Ethan Kou, Haoru Xue, Huang Huang, Wenli Xiao, Guanzhi Wang, Fei-Fei Li, Guanya Shi, Jiajun Wu, Shankar Sastry, Yuke Zhu, Ken Goldberg, Linxi "Jim" Fan

PDF

Open Access

TL;DR

CaP-X introduces a comprehensive framework for benchmarking and enhancing coding agents in robot manipulation, demonstrating how structured code, scaling, and reinforcement learning improve robustness and transferability in embodied tasks.

Contribution

The paper presents CaP-X, an open-access platform for studying and improving Code-as-Policy agents, including new environments, benchmarks, and methods for robustness and sim2real transfer.

Findings

01

Performance improves with human-crafted abstractions

02

Scaling agentic computation enhances robustness

03

Reinforcement learning with verifiable rewards boosts success rates

Abstract

"Code-as-Policy" considers how executable code can complement data-intensive Vision-Language-Action (VLA) methods, yet their effectiveness as autonomous controllers for embodied manipulation remains underexplored. We present CaP-X, an open-access framework for systematically studying Code-as-Policy agents in robot manipulation. At its core is CaP-Gym, an interactive environment in which agents control robots by synthesizing and executing programs that compose perception and control primitives. Building on this foundation, CaP-Bench evaluates frontier language and vision-language models across varying levels of abstraction, interaction, and perceptual grounding. Across 12 models, CaP-Bench reveals a consistent trend: performance improves with human-crafted abstractions but degrades as these priors are removed, exposing a dependence on designer scaffolding. At the same time, we observe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics