Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

Qing Zhang; Xuesong Li; Jing Zhang

arXiv:2602.20501·cs.CV·March 17, 2026

Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation Models

Qing Zhang, Xuesong Li, Jing Zhang

PDF

Open Access

TL;DR

This paper investigates how vision foundation models understand affordance by analyzing their geometric and interaction perception capabilities, and demonstrates that combining these cues enhances affordance reasoning without additional training.

Contribution

It reveals that geometric and interaction cues are key to affordance understanding in VFMs and shows that their simple fusion improves affordance estimation in a zero-shot manner.

Findings

01

DINO encodes part-level geometric structures

02

Flux contains verb-conditioned spatial attention maps

03

Fusion of geometric and interaction cues improves affordance estimation

Abstract

What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free and zero-shot manner, we achieve affordance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety