Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances
Qirui Wang, Jingyi He, Yining Pan, Xulei Yang, Shijie Li

TL;DR
AFFORDMEM is a training-free framework that improves 3D functional affordance grounding by leveraging cross-scene and in-scene memory, enhancing localization and reference resolution without model fine-tuning.
Contribution
It introduces a novel memory-based approach for 3D affordance grounding that does not require scene annotation or model fine-tuning, utilizing reusable memory banks and scene graphs.
Findings
Improves AP50 by 3.23 and 3.7 over prior state-of-the-art on SceneFun3D.
Cross-scene memory enhances fine-grained localization.
In-scene spatial memory aids in resolving references to distant or unobserved candidates.
Abstract
Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
