A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning
Zixin Zhang, Kanghao Chen, Hanqing Wang, Hongfei Zhang, Harold Haodong Chen, Chenfei Liao, Litao Guo, Ying-Cong Chen

TL;DR
A4-Agent introduces a zero-shot, three-stage framework for affordance prediction that leverages pre-trained models without fine-tuning, significantly improving generalization and performance over existing methods.
Contribution
It proposes a novel, training-free, three-stage agentic framework that decouples affordance reasoning into visualization, decision, and localization using foundation models.
Findings
Outperforms state-of-the-art supervised methods on multiple benchmarks.
Demonstrates robust generalization to real-world environments.
Operates effectively without task-specific training or fine-tuning.
Abstract
Affordance prediction, which identifies interaction regions on objects based on language instructions, is critical for embodied AI. Prevailing end-to-end models couple high-level reasoning and low-level grounding into a single monolithic pipeline and rely on training over annotated datasets, which leads to poor generalization on novel objects and unseen environments. In this paper, we move beyond this paradigm by proposing A4-Agent, a training-free agentic framework that decouples affordance prediction into a three-stage pipeline. Our framework coordinates specialized foundation models at test time: (1) a that employs generative models to visualize an interaction would look; (2) a that utilizes large vision-language models to decide object part to interact with; and (3) a that orchestrates vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics
