GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

Wenxuan Guo; Ziyuan Li; Meng Zhang; Yichen Liu; Yimeng Dong; Chuxi Xu; Yunfei Wei; Ze Chen; Erjin Zhou; Jianjiang Feng

arXiv:2605.22812·cs.RO·May 22, 2026

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

Wenxuan Guo, Ziyuan Li, Meng Zhang, Yichen Liu, Yimeng Dong, Chuxi Xu, Yunfei Wei, Ze Chen, Erjin Zhou, Jianjiang Feng

PDF

1 Repo

TL;DR

GesVLA introduces a gesture-aware vision-language-action model that enhances robot manipulation by integrating gesture cues, improving spatial understanding and interaction efficiency in complex scenes.

Contribution

The paper proposes a novel dual-VLM architecture that encodes gesture features into the latent space for better reasoning and action generation, along with a scalable gesture data pipeline and a two-stage training strategy.

Findings

01

Gesture integration improves target grounding accuracy.

02

Model enhances human-robot interaction in cluttered environments.

03

Experimental validation on real-world robotic tasks shows significant performance gains.

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gwxuan.github.io/GesVLA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.