HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek

TL;DR
HandVQA is a large-scale benchmark designed to evaluate and improve vision-language models' understanding of detailed hand anatomy and spatial reasoning, revealing current limitations and enabling zero-shot transfer to downstream tasks.
Contribution
The paper introduces HandVQA, a diagnostic benchmark with over 1.6 million questions based on 3D hand datasets, to assess and enhance VLMs' fine-grained hand spatial reasoning.
Findings
Current VLMs exhibit hallucinated finger parts and incorrect geometric reasoning.
Fine-tuning with HandVQA improves model accuracy on hand-related tasks.
Zero-shot spatial knowledge transfer enhances downstream hand gesture recognition.
Abstract
Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
