Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

Xiuxiu Qi; Yu Yang; Jiannong Cao; Luyao Bai; Chongshan Fan; Chengtai Cao; Hongpeng Wang

arXiv:2511.14396·cs.RO·December 24, 2025

Continuous Vision-Language-Action Co-Learning with Semantic-Physical Alignment for Behavioral Cloning

Xiuxiu Qi, Yu Yang, Jiannong Cao, Luyao Bai, Chongshan Fan, Chengtai Cao, Hongpeng Wang

PDF

Open Access

TL;DR

This paper introduces CCoL, a continuous co-learning framework for behavioral cloning that aligns vision, language, and physical states to improve robot manipulation accuracy and robustness.

Contribution

It proposes a novel semantic-physical alignment method using bidirectional cross-attention for more accurate and smooth action generation in behavioral cloning.

Findings

01

Achieves 8.0% average improvement across simulation suites.

02

Up to 19.2% gain in bimanual insertion tasks.

03

Demonstrates successful real-world generalization on a 7-DoF robot.

Abstract

Language-conditioned manipulation facilitates human-robot interaction via behavioral cloning (BC), which learns control policies from human demonstrations and serves as a cornerstone of embodied AI. Overcoming compounding errors in sequential action decisions remains a central challenge to improving BC performance. Existing approaches mitigate compounding errors through data augmentation, expressive representation, or temporal abstraction. However, they suffer from physical discontinuities and semantic-physical misalignment, leading to inaccurate action cloning and intermittent execution. In this paper, we present Continuous vision-language-action Co-Learning with Semantic-Physical Alignment (CCoL), a novel BC framework that ensures temporally consistent execution and fine-grained semantic grounding. It generates robust and smooth action execution trajectories through continuous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Social Robot Interaction and HRI