HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

MD Khalequzzaman Chowdhury Sayem; Mubarrat Tajoar Chowdhury; Yihalem Yimolal Tiruneh; Muneeb A. Khan; Muhammad Salman Ali; Binod Bhattarai; Seungryul Baek

arXiv:2603.26362·cs.CV·March 30, 2026

HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models

MD Khalequzzaman Chowdhury Sayem, Mubarrat Tajoar Chowdhury, Yihalem Yimolal Tiruneh, Muneeb A. Khan, Muhammad Salman Ali, Binod Bhattarai, Seungryul Baek

PDF

1 Datasets

TL;DR

HandVQA is a large-scale benchmark designed to evaluate and improve vision-language models' understanding of detailed hand anatomy and spatial reasoning, revealing current limitations and enabling zero-shot transfer to downstream tasks.

Contribution

The paper introduces HandVQA, a diagnostic benchmark with over 1.6 million questions based on 3D hand datasets, to assess and enhance VLMs' fine-grained hand spatial reasoning.

Findings

01

Current VLMs exhibit hallucinated finger parts and incorrect geometric reasoning.

02

Fine-tuning with HandVQA improves model accuracy on hand-related tasks.

03

Zero-shot spatial knowledge transfer enhances downstream hand gesture recognition.

Abstract

Understanding the fine-grained articulation of human hands is critical in high-stakes settings such as robot-assisted surgery, chip manufacturing, and AR/VR-based human-AI interaction. Despite achieving near-human performance on general vision-language benchmarks, current vision-language models (VLMs) struggle with fine-grained spatial reasoning, especially in interpreting complex and articulated hand poses. We introduce HandVQA, a large-scale diagnostic benchmark designed to evaluate VLMs' understanding of detailed hand anatomy through visual question answering. Built upon high-quality 3D hand datasets (FreiHAND, InterHand2.6M, FPHA), our benchmark includes over 1.6M controlled multiple-choice questions that probe spatial relationships between hand joints, such as angles, distances, and relative positions. We evaluate several state-of-the-art VLMs (LLaVA, DeepSeek and Qwen-VL) in both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kcsayem/handvqa
dataset· 488 dl
488 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.