TL;DR
GesVLA introduces a gesture-aware vision-language-action model that enhances robot manipulation by integrating gesture cues, improving spatial understanding and interaction efficiency in complex scenes.
Contribution
The paper proposes a novel dual-VLM architecture that encodes gesture features into the latent space for better reasoning and action generation, along with a scalable gesture data pipeline and a two-stage training strategy.
Findings
Gesture integration improves target grounding accuracy.
Model enhances human-robot interaction in cluttered environments.
Experimental validation on real-world robotic tasks shows significant performance gains.
Abstract
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To address this limitation, we introduce gesture as a parallel instruction modality and propose a Gesture-aware Vision-Language-Action model (GesVLA). Our approach encodes gesture features directly into the latent space, enabling them to participate in both high-level reasoning and low-level action generation, and adopts a dual-VLM architecture to achieve tight coupling between gesture representations and action policies. At the data level, we construct a scalable gesture data generation pipeline by rendering hand models onto real-world scene images. This reduces the sim-to-real visual gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
