Language-Grounded Decoupled Action Representation for Robotic Manipulation
Wuding Weng, Tongshu Wu, Liucheng Chen, Siyu Xie, Zheng Wang, Xing Xu, Jingkuan Song, Heng Tao Shen

TL;DR
This paper introduces LaDA, a language-grounded framework for robotic manipulation that uses interpretable action primitives and contrastive learning to improve generalization across tasks.
Contribution
LaDA is a novel decoupled action representation leveraging natural language and semantic primitives, enhancing generalization and robustness in robotic manipulation.
Findings
LaDA outperforms existing methods on simulated benchmarks.
It generalizes well to unseen or related tasks.
Demonstrates effective transfer from simulation to real-world scenarios.
Abstract
The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives--translation, rotation, and gripper control--providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
