Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding

Zhuoming Li; Aitong Liu; Mengxi Jia; Yubi Lu; Tengxiang Zhang; Changzhi Sun; Dell Zhang; Xuelong Li

arXiv:2510.21814·cs.CV·November 7, 2025

Gestura: A LVLM-Powered System Bridging Motion and Semantics for Real-Time Free-Form Gesture Understanding

Zhuoming Li, Aitong Liu, Mengxi Jia, Yubi Lu, Tengxiang Zhang, Changzhi Sun, Dell Zhang, Xuelong Li

PDF

TL;DR

Gestura is an innovative end-to-end system that leverages large vision-language models and specialized modules to improve real-time free-form gesture understanding, surpassing previous methods in accuracy and responsiveness.

Contribution

The paper introduces Gestura, combining LVLMs with a Landmark Processing Module and Chain-of-Thought reasoning for enhanced free-form gesture recognition.

Findings

01

Achieves higher recognition accuracy than existing solutions.

02

Provides a new large-scale dataset with 300,000+ annotated QA pairs.

03

Demonstrates robustness in understanding diverse and ambiguous gestures.

Abstract

Free-form gesture understanding is highly appealing for human-computer interaction, as it liberates users from the constraints of predefined gesture categories. However, the sole existing solution GestureGPT suffers from limited recognition accuracy and slow response times. In this paper, we propose Gestura, an end-to-end system for free-form gesture understanding. Gestura harnesses a pre-trained Large Vision-Language Model (LVLM) to align the highly dynamic and diverse patterns of free-form gestures with high-level semantic concepts. To better capture subtle hand movements across different styles, we introduce a Landmark Processing Module that compensate for LVLMs' lack of fine-grained domain knowledge by embedding anatomical hand priors. Further, a Chain-of-Thought (CoT) reasoning strategy enables step-by-step semantic inference, transforming shallow knowledge into deep semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.