GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

TL;DR
GRASP introduces a large-scale dataset and evaluation framework for social reasoning in multi-person videos, emphasizing non-verbal cues like gaze and gestures.
Contribution
The paper presents GRASP, a novel dataset linking social QA with fine-grained non-verbal cues, and proposes SGR to enhance social reasoning in models.
Findings
SGR improves model performance on GRASP-Bench.
GRASP dataset contains 290K QA pairs over 46K videos.
Models trained with SGR maintain zero-shot performance on related benchmarks.
Abstract
Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
