Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Theodor Wulff; Federico Tavella; Rahul Singh Maharjan; Manith Adikari; Angelo Cangelosi

arXiv:2604.05614·cs.RO·April 8, 2026

Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

Theodor Wulff, Federico Tavella, Rahul Singh Maharjan, Manith Adikari, Angelo Cangelosi

PDF

TL;DR

This paper introduces a novel training framework for hierarchical vision-language-action models that explicitly aligns language descriptions with visual observations and actions, improving robot transparency and human-robot collaboration.

Contribution

It proposes a contrastive learning approach for explicit language-action alignment in hierarchical VLA models, reducing reliance on extensive annotations.

Findings

01

Achieves performance comparable to fully supervised fine-tuning.

02

Provides insights into multimodal grounding representations.

03

Establishes a strong baseline with minimal data annotations.

Abstract

Achieving robot transparency is a critical step toward effective human-robot collaboration. To be transparent, a robot's natural language communication must be consistent with its actions and explicitly grounded in the task and environment. Existing hierarchical Vision-Language-Action (VLA) models can generate language (e.g., through chain-of-thought) and low-level actions. However, current work does not consider explicit alignment between these modalities during training. To address this crucial gap, we propose a novel training framework that explicitly grounds hierarchical VLA sub-task descriptions with respect to the visual observation and action space. Our framework uses a contrastive model to assess the alignment between generated language and corresponding action trajectories. This contrastive model enables direct ranking of different language-trajectory pairs based on their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.