LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models

Yuchen Hou; Lin Zhao

arXiv:2603.00592·cs.RO·March 3, 2026

LangGap: Diagnosing and Closing the Language Gap in Vision-Language-Action Models

Yuchen Hou, Lin Zhao

PDF

Open Access

TL;DR

This paper introduces LangGap, a benchmark designed to diagnose and improve language understanding in vision-language-action models by using semantic perturbations and diverse tasks, revealing significant gaps in current models.

Contribution

The paper presents LangGap, a novel benchmark with semantic perturbations and diverse tasks to systematically evaluate and diagnose language understanding deficits in VLA models.

Findings

01

Targeted data augmentation improves success rate from 0% to 90%.

02

Multi-task training increases success rate from 0% to 28%.

03

Models struggle with increased semantic diversity, revealing fundamental limitations.

Abstract

Vision-Language-Action (VLA) models achieve over 95% success on standard benchmarks. However, through systematic experiments, we find that current state-of-the-art VLA models largely ignore language instructions. Prior work lacks: (1) systematic semantic perturbation diagnostics, (2) a benchmark that forces language understanding by design, and (3) linguistically diverse training data. This paper constructs the LangGap benchmark, based on a four-dimensional semantic perturbation method -- varying instruction semantics while keeping the tabletop layout fixed -- revealing language understanding deficits in {\pi}0.5. Existing benchmarks like LIBERO assign only one task per layout, underutilizing available objects and target locations; LangGap fully diversifies pick-and-place tasks under identical layouts, forcing models to truly understand language. Experiments show that targeted data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Neural Network Applications