KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition

Gaoge Han; Zhengqing Gao; Ziwen Li; Jiaxin Huang; Shaoli Huang; Fakhri Karray; Mingming Gong; Tongliang Liu

arXiv:2603.17524·cs.RO·March 19, 2026

KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition

Gaoge Han, Zhengqing Gao, Ziwen Li, Jiaxin Huang, Shaoli Huang, Fakhri Karray, Mingming Gong, Tongliang Liu

PDF

Open Access

TL;DR

KineVLA introduces a framework for fine-grained, kinematics-aware vision-language-action tasks, enabling robots to interpret and execute complex, instruction-level kinematic commands with improved precision and adaptability.

Contribution

The paper proposes a novel bi-level action representation and reasoning approach for kinematics-rich VLA tasks, along with new datasets and benchmarks for evaluation.

Findings

01

KineVLA outperforms existing baselines on kinematics-sensitive benchmarks.

02

The framework achieves more precise and controllable robotic manipulation.

03

Experiments validate the effectiveness of bi-level reasoning tokens in aligning language and action.

Abstract

In this paper, we introduce a novel kinematics-rich vision-language-action (VLA) task, in which language commands densely encode diverse kinematic attributes (such as direction, trajectory, orientation, and relative displacement) from initiation through completion, at key moments, unlike existing action instructions that capture kinematics only coarsely or partially, thereby supporting fine-grained and personalized manipulation. In this setting, where task goals remain invariant while execution trajectories must adapt to instruction-level kinematic specifications. To address this challenge, we propose KineVLA, a vision-language-action framework that explicitly decouples goal-level invariance from kinematics-level variability through a bi-level action representation and bi-level reasoning tokens to serve as explicit, supervised intermediate variables that align language and action. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics