Learning Physical Graph Representations from Visual Scenes
Daniel M. Bear, Chaofei Fan, Damian Mrowca, Yunzhu Li, Seth Alter,, Aran Nayebi, Jeremy Schwartz, Li Fei-Fei, Jiajun Wu, Joshua B. Tenenbaum,, Daniel L.K. Yamins

TL;DR
This paper introduces Physical Scene Graphs (PSGs) and PSGNet, a novel neural network architecture that explicitly encodes objects, parts, and their physical properties in scenes, improving scene understanding beyond traditional CNNs.
Contribution
The paper proposes PSGs as hierarchical graph representations of scenes and PSGNet to learn these structures, integrating feedback, graph pooling, and perceptual grouping for enhanced scene segmentation.
Findings
PSGNet outperforms existing self-supervised methods on scene segmentation.
PSGNet generalizes well to unseen objects and arrangements.
Learned latent attributes capture intuitive scene properties.
Abstract
Convolutional Neural Networks (CNNs) have proved exceptional at learning representations for visual object categorization. However, CNNs do not explicitly encode objects, parts, and their physical properties, which has limited CNNs' success on tasks that require structured understanding of visual scenes. To overcome these limitations, we introduce the idea of Physical Scene Graphs (PSGs), which represent scenes as hierarchical graphs, with nodes in the hierarchy corresponding intuitively to object parts at different scales, and edges to physical connections between parts. Bound to each node is a vector of latent attributes that intuitively represent object properties such as surface shape and texture. We also describe PSGNet, a network architecture that learns to extract PSGs by reconstructing scenes through a PSG-structured bottleneck. PSGNet augments standard CNNs by including:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
