Predicate Hierarchies Improve Few-Shot State Classification
Emily Jin, Joy Hsu, Jiajun Wu

TL;DR
This paper introduces PHIER, a method that uses predicate hierarchies and structured latent spaces to improve few-shot state classification in robotic environments, enabling better generalization with limited data.
Contribution
PHIER is the first approach to leverage predicate hierarchies and hyperbolic embeddings for few-shot state classification in robotics.
Findings
PHIER outperforms existing methods in few-shot and out-of-distribution scenarios.
PHIER demonstrates strong zero- and few-shot generalization from simulation to real-world tasks.
Predicate hierarchies significantly improve state classification accuracy with limited data.
Abstract
State classification of objects and their relations is core to many long-horizon tasks, particularly in robot planning and manipulation. However, the combinatorial explosion of possible object-predicate combinations, coupled with the need to adapt to novel real-world environments, makes it a desideratum for state classification models to generalize to novel queries with few examples. To this end, we propose PHIER, which leverages predicate hierarchies to generalize effectively in few-shot scenarios. PHIER uses an object-centric scene encoder, self-supervised losses that infer semantic relations between predicates, and a hyperbolic distance metric that captures hierarchical structure; it learns a structured latent space of image-predicate pairs that guides reasoning over state classification queries. We evaluate PHIER in the CALVIN and BEHAVIOR robotic environments and show that PHIER…
Peer Reviews
Decision·ICLR 2025 Poster
1. Incorporates predicate hierarchies effectively for few-shot state classification. 2. Demonstrates strong generalization from simulator images to real-world images. 3. Utilizes hyperbolic space to encode complex hierarchical relationships efficiently.
1. Why does the proposed method perform worse than baseline methods on in-distribution samples? More detailed analysis and insights are needed. 2. Why are there no results for pre-trained models on ID-OOD samples or real-world samples? 3. In the ablation study, components are added sequentially; however, the order of addition matters. Further analysis with ablations that remove each component should be conducted. 4. Important details are missing, such as the process for constructing positive and
- All deep learning components and encoding strategies are based on strong intuitions and deep understanding of the different latent spaces and tools used. - Empirical results show significant improvement upon current state of the art. - Tested on environments with varying levels of complexity and realism. - Well cited with a comprehensive literature review in the related work section - The method is clearly presented bit by bit, slowly building up the method components in a readable fashion. -
- Theoretical justification of the method is relatively weak. It would be groundbreaking to show a strongly linked connection between hyperbolic space and hierarchical structure. This point is still unclear, even though the authors provide some intuition for it. - Introduction lacks citations, making it difficult to understand which parts are novel and which belong to previous work (before diving in to the rest of the paper). - No evidence is provided for claims on that large vision models strug
- The paper is very well written, and the method is clear and easy to follow. - Most design choices are justified and effective, as shown with ablations. - Good experimental setup and results in simulation. The results regarding novel predicates show that the learned latent space is efficient, especially compared to baselines.
- The ablation section needs more details - In its current state, is not clear what type of architecture the ablation baseline models utilize. What is the architecture of the supervised model? Just an image encoder and the predicate encoder followed by the supervised loss? This needs clarification. What is the hyperbolic linear layer replaced with? Simple MLPs? - The network architecture is unclear from the current state of the method section. What are the trainable parameters? Where exactly
Videos
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Machine Learning and Data Classification
