Weakly-Supervised Affordance Grounding Guided by Part-Level Semantic Priors
Peiran Xu, Yadong Mu

TL;DR
This paper presents a novel weakly supervised approach for affordance grounding that leverages foundation models and part-level semantic priors, significantly improving localization accuracy without dense labels.
Contribution
It introduces a new training pipeline using pseudo labels from part segmentation models, along with three key enhancements for better affordance localization.
Findings
Achieved state-of-the-art performance on affordance grounding tasks.
Demonstrated the effectiveness of part-level semantic priors in weak supervision.
Showed significant improvement over existing methods.
Abstract
In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human-object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off-the-shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in…
Peer Reviews
Decision·ICLR 2025 Poster
1) The paper is clearly written and easy to follow. 2) The method is well-motivated, and the VFM-assisted pseudo-labeling should effectively address the challenges of the weakly-supervised setting. 3) The overall improvements over existing methods are quite significant.
My biggest concern lies in the experimental section. In Table 2, the reasoning model appears to negatively impact the baseline, and the other two design components only provide marginal improvements.
- The problem is important and well-motivated, as affordance grounding is crucial for robotic manipulation and human-object interaction understanding - The proposed pseudo-labeling approach effectively leverages existing foundation models (VLpart, SAM) to provide supervision, addressing limitations of previous CAM-based methods - The label refinement process using exocentric images is novel and well-designed, providing a clever way to improve initial pseudo labels - The reasoning module helps ge
The choice of CLIP as the vision encoder could be better justified given previous work suggesting limitations (vs DINO, OWLViT, SAM). For example, the paper will be stronger with an ablation study of different visual encoders.
- Clear writing and organization. - Well-motivated technical approach with clear problem formulation. - This paper propose a novel approach that uses visual foundation models and part-level semantic priors for WSAG, unleashing the power of these models for affordance learning. - Using human occlusion cues for label refinement, which is an innovative insight. - Comprehensive experimental validation and thoughtful analysis of limitations in existing methods.
- Could benefit from more analysis of failure cases. - The label refinement stage using human occlusion cues may be problematic when interactions are ambiguous or when multiple affordances exist. - The mapping from affordance to part names is ad-hoc and manually crafted, which limits the scalability to new affordance types and more complex objects.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Model Reduction and Neural Networks · Robot Manipulation and Learning
MethodsFocus
