HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

Yibin Liu; Zhixuan Liang; Zanxin Chen; Tianxing Chen; Mengkang Hu; Wanxi Dong; Congsheng Xu; Zhaoming Han; Yusen Qin; Yao Mu

arXiv:2508.02629·cs.RO·August 7, 2025

HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

Yibin Liu, Zhixuan Liang, Zanxin Chen, Tianxing Chen, Mengkang Hu, Wanxi Dong, Congsheng Xu, Zhaoming Han, Yusen Qin, Yao Mu

PDF

TL;DR

HyCodePolicy introduces a hybrid control framework for embodied agents that combines code generation, geometric grounding, perceptual monitoring, and iterative repair to enhance robustness and efficiency in task execution.

Contribution

It presents a novel hybrid language-based control system that integrates multiple modalities and feedback mechanisms for adaptive policy repair in embodied agents.

Findings

01

Significantly improves robustness of robot manipulation policies.

02

Enhances sample efficiency in policy learning.

03

Enables self-correcting program synthesis with minimal supervision.

Abstract

Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.