Comparing Visual Reasoning in Humans and AI
Shravan Murlidaran, William Yang Wang, Miguel P. Eckstein

TL;DR
This study compares how humans and AI interpret complex scenes with social interactions, revealing significant differences in scene description agreement and highlighting areas where AI models lag behind human visual reasoning.
Contribution
Introduces a new dataset of complex social scenes and a quantitative metric to compare human and AI scene descriptions, advancing understanding of their interpretative differences.
Findings
AI and human agreement on scene descriptions is much lower than human-human agreement.
Humans utilize different image regions than AI when interpreting scenes.
Results highlight specific areas where AI models fall short of human visual reasoning.
Abstract
Recent advances in natural language processing and computer vision have led to AI models that interpret simple scenes at human levels. Yet, we do not have a complete understanding of how humans and AI models differ in their interpretation of more complex scenes. We created a dataset of complex scenes that contained human behaviors and social interactions. AI and humans had to describe the scenes with a sentence. We used a quantitative metric of similarity between scene descriptions of the AI/human and ground truth of five other human descriptions of each scene. Results show that the machine/human agreement scene descriptions are much lower than human/human agreement for our complex scenes. Using an experimental manipulation that occludes different spatial regions of the scenes, we assessed how machines and humans vary in utilizing regions of images to understand the scenes. Together,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
