A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Zixin Zhang; Kanghao Chen; Hanqing Wang; Hongfei Zhang; Harold Haodong Chen; Chenfei Liao; Litao Guo; Ying-Cong Chen

arXiv:2512.14442·cs.CV·December 17, 2025

A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen

PDF

Open Access

TL;DR

A4-Agent introduces a zero-shot, three-stage framework for affordance prediction that leverages pre-trained models without fine-tuning, significantly improving generalization and performance over existing methods.

Contribution

It proposes a novel, training-free, three-stage agentic framework that decouples affordance reasoning into visualization, decision, and localization using foundation models.

Findings

01

Outperforms state-of-the-art supervised methods on multiple benchmarks.

02

Demonstrates robust generalization to real-world environments.

03

Operates effectively without task-specific training or fine-tuning.

Abstract

Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a $Dreamer$ that employs generative models to visualize $how$ an interaction would look; (2) a $Thinker$ that utilizes large vision-language models to decide $what$ object part to interact with; and (3) a $Spotter$ that orchestrates vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics