Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Qirui Wang; Jingyi He; Yining Pan; Xulei Yang; Shijie Li

arXiv:2605.11616·cs.CV·May 13, 2026

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Qirui Wang, Jingyi He, Yining Pan, Xulei Yang, Shijie Li

PDF

TL;DR

AFFORDMEM is a training-free framework that improves 3D functional affordance grounding by leveraging cross-scene and in-scene memory, enhancing localization and reference resolution without model fine-tuning.

Contribution

It introduces a novel memory-based approach for 3D affordance grounding that does not require scene annotation or model fine-tuning, utilizing reusable memory banks and scene graphs.

Findings

01

Improves AP50 by 3.23 and 3.7 over prior state-of-the-art on SceneFun3D.

02

Cross-scene memory enhances fine-grained localization.

03

In-scene spatial memory aids in resolving references to distant or unobserved candidates.

Abstract

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.