HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang; Guanyu Chen; Yutian Chen; Zhixuan Liang; Yitian Liu; Zanxin Chen; Chunpu Xu; Haotian Liang; Jiangmiao Pang; Yao Mu; Ping Luo

arXiv:2604.14125·cs.CV·May 12, 2026

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

PDF

TL;DR

HiVLA introduces a hierarchical framework that separates high-level planning from low-level control, enhancing robotic manipulation by preserving reasoning and improving fine-grained task execution.

Contribution

The paper presents HiVLA, a novel decoupled architecture combining a VLM-based planner with a flow-matching Diffusion Transformer for robust manipulation.

Findings

01

HiVLA outperforms state-of-the-art end-to-end models in simulation and real-world tasks.

02

It excels in long-horizon skill composition and manipulating small objects in cluttered scenes.

03

The architecture maintains zero-shot reasoning while allowing independent component improvements.

Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.