TL;DR
This paper introduces the 3D-PC benchmark to evaluate and compare human and neural network abilities in visual perspective taking, revealing that DNNs excel at 3D scene analysis but struggle with human-like perspective reasoning.
Contribution
The paper presents a novel benchmark for 3D perception and perspective taking, highlighting the gap between DNNs and humans in complex visual reasoning tasks.
Findings
DNNs perform well on object depth order tasks, approaching human accuracy.
DNNs struggle with perspective taking tasks that require reasoning beyond basic 3D analysis.
Fine-tuning improves DNN performance on simple tasks but not on strategy-limited perspective taking.
Abstract
Visual perspective taking (VPT) is the ability to perceive and reason about the perspectives of others. It is an essential feature of human intelligence, which develops over the first decade of life and requires an ability to process the 3D structure of visual scenes. A growing number of reports have indicated that deep neural networks (DNNs) become capable of analyzing 3D scenes after training on large image datasets. We investigated if this emergent ability for 3D analysis in DNNs is sufficient for VPT with the 3D perception challenge (3D-PC): a novel benchmark for 3D perception in humans and DNNs. The 3D-PC is comprised of three 3D-analysis tasks posed within natural scene images: 1. a simple test of object depth order, 2. a basic VPT task (VPT-basic), and 3. another version of VPT (VPT-Strategy) designed to limit the effectiveness of "shortcut" visual strategies. We tested human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
