SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

Xinyu Tan; Ningwei Bai; Harry Gardener; Zhengyang Zhong; Luoyu Zhang; Liuhaichen Yang; Zhekai Duan; Monkgogi Galeitsiwe; Zezhi Tang

arXiv:2602.22514·cs.RO·February 27, 2026

SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation

Xinyu Tan, Ningwei Bai, Harry Gardener, Zhengyang Zhong, Luoyu Zhang, Liuhaichen Yang, Zhekai Duan, Monkgogi Galeitsiwe, Zezhi Tang

PDF

Open Access

TL;DR

SignVLA introduces a novel gloss-free vision-language-action framework for real-time sign language-guided robotic manipulation, enhancing natural interaction and reducing annotation costs by directly mapping visual signs to commands.

Contribution

This work presents the first gloss-free VLA system for sign language-driven robotic control, enabling scalable, natural, and real-time human-robot interaction without gloss annotations.

Findings

01

Effective real-time alphabet-level finger-spelling interface.

02

Robust transformation of gesture streams into language commands.

03

Demonstrated precise robotic actions grounded in sign instructions.

Abstract

We present, to our knowledge, the first sign language-driven Vision-Language-Action (VLA) framework for intuitive and inclusive human-robot interaction. Unlike conventional approaches that rely on gloss annotations as intermediate supervision, the proposed system adopts a gloss-free paradigm and directly maps visual sign gestures to semantic instructions. This design reduces annotation cost and avoids the information loss introduced by gloss representations, enabling more natural and scalable multimodal interaction. In this work, we focus on a real-time alphabet-level finger-spelling interface that provides a robust and low-latency communication channel for robotic control. Compared with large-scale continuous sign language recognition, alphabet-level interaction offers improved reliability, interpretability, and deployment feasibility in safety-critical embodied environments. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Multimodal Machine Learning Applications · Human Pose and Action Recognition