HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno; Gido Kato; Hirokatsu Kataoka; Yoichi Sato; Takuma Yagi

arXiv:2512.00885·cs.CV·December 2, 2025

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi

PDF

Open Access

TL;DR

HanDyVQA is a new fine-grained video question-answering benchmark that captures detailed hand-object interaction dynamics, challenging current models and highlighting areas for improvement in spatial, motion, and part-level understanding.

Contribution

Introduces HanDyVQA, a comprehensive benchmark with diverse question types, segmentation masks, and analysis of model performance on fine-grained HOI reasoning.

Findings

01

Current models reach only 73% accuracy, far below human performance.

02

Explicit HOI cues improve model accuracy.

03

Challenges remain in spatial, motion, and part-level reasoning.

Abstract

Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Robot Manipulation and Learning