Simple 3D Pose Features Support Human and Machine Social Scene Understanding
Wenshuo Qin, Leyla Isik

TL;DR
This paper shows that simple 3D pose features are crucial for understanding social interactions in videos, outperforming many deep neural networks and aligning closely with human judgments.
Contribution
It introduces a novel 3D pose extraction pipeline and demonstrates that minimal 3D features are sufficient for social scene understanding, surpassing existing DNNs.
Findings
3D body joints predict social judgments better than most DNNs.
Minimal 3D features explain the prediction performance of full body joint sets.
3D pose features improve DNN alignment with human social judgments.
Abstract
Humans effortlessly recognize social interactions from visual input, yet the underlying computations remain unknown, and social interaction recognition challenges even the most advanced deep neural networks (DNNs). Here, we hypothesized that humans rely on 3D visuospatial pose information to make social judgments, and that this information is largely absent from most vision DNNs. To test these hypotheses, we used a novel pose and depth estimation pipeline to automatically extract 3D body joint positions from short video clips. We compared the ability of these body joints to predict human social judgments in the videos with embeddings from over 350 vision DNNs. We found that body joints predicted social judgments better than most DNNs. We then reduced the 3D body joints to an even more compact feature set describing only the 3D position and direction of people in the videos. We found…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Action Observation and Synchronization · Emotion and Mood Recognition
