GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Junho Kim; Xu Cao; Houze Yang; Bikram Boote; Ana Jojic; Fiona Ryan; Bolin Lai; Sangmin Lee; James M. Rehg

arXiv:2605.15764·cs.CV·May 18, 2026

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

PDF

TL;DR

GRASP introduces a large-scale dataset and evaluation framework for social reasoning in multi-person videos, emphasizing non-verbal cues like gaze and gestures.

Contribution

The paper presents GRASP, a novel dataset linking social QA with fine-grained non-verbal cues, and proposes SGR to enhance social reasoning in models.

Findings

01

SGR improves model performance on GRASP-Bench.

02

GRASP dataset contains 290K QA pairs over 46K videos.

03

Models trained with SGR maintain zero-shot performance on related benchmarks.

Abstract

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.