Language-Grounded Decoupled Action Representation for Robotic Manipulation

Wuding Weng; Tongshu Wu; Liucheng Chen; Siyu Xie; Zheng Wang; Xing Xu; Jingkuan Song; Heng Tao Shen

arXiv:2603.12967·cs.RO·March 16, 2026

Language-Grounded Decoupled Action Representation for Robotic Manipulation

Wuding Weng, Tongshu Wu, Liucheng Chen, Siyu Xie, Zheng Wang, Xing Xu, Jingkuan Song, Heng Tao Shen

PDF

Open Access

TL;DR

This paper introduces LaDA, a language-grounded framework for robotic manipulation that uses interpretable action primitives and contrastive learning to improve generalization across tasks.

Contribution

LaDA is a novel decoupled action representation leveraging natural language and semantic primitives, enhancing generalization and robustness in robotic manipulation.

Findings

01

LaDA outperforms existing methods on simulated benchmarks.

02

It generalizes well to unseen or related tasks.

03

Demonstrates effective transfer from simulation to real-world scenarios.

Abstract

The heterogeneity between high-level vision-language understanding and low-level action control remains a fundamental challenge in robotic manipulation. Although recent methods have advanced task-specific action alignment, they often struggle to generate robust and accurate actions for novel or semantically related tasks. To address this, we propose the Language-Grounded Decoupled Action Representation (LaDA) framework, which leverages natural language as a semantic bridge to connect perception and control. LaDA introduces a fine-grained intermediate layer of three interpretable action primitives--translation, rotation, and gripper control--providing explicit semantic structure for low-level actions. It further employs a semantic-guided soft-label contrastive learning objective to align similar action primitives across tasks, enhancing generalization and motion consistency. An adaptive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition