Compositional Law Parsing with Latent Random Functions
Fan Shi, Bin Li, Xiangyang Xue

TL;DR
This paper introduces CLAP, a deep latent variable model that learns and manipulates the underlying laws of concepts in visual scenes, demonstrating human-like compositional understanding and interpretability.
Contribution
It presents a novel encoding-decoding architecture with concept-specific latent random functions using Neural Processes for compositional law parsing.
Findings
Outperforms baseline methods in physics and reasoning tasks
Enables law manipulation and composition for interpretability
Learns laws of position and appearance from visual scenes
Abstract
Human cognition has compositionality. We understand a scene by decomposing the scene into different concepts (e.g., shape and position of an object) and learning the respective laws of these concepts, which may be either natural (e.g., laws of motion) or man-made (e.g., laws of a game). The automatic parsing of these laws indicates the model's ability to understand the scene, which makes law parsing play a central role in many visual tasks. This paper proposes a deep latent variable model for Compositional LAw Parsing (CLAP), which achieves the human-like compositionality ability through an encoding-decoding architecture to represent concepts in the scene as latent variables. CLAP employs concept-specific latent random functions instantiated with Neural Processes to capture the law of concepts. Our experimental results demonstrate that CLAP outperforms the baseline methods in multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
