TRACE: Textual Reasoning for Affordance Coordinate Extraction
Sangyun Park, Jin Kim, Yuchen Cui, Matthew S. Brown

TL;DR
This paper introduces TRACE, a method that enhances vision-language models for robotic affordance prediction by integrating textual reasoning, leading to improved accuracy, interpretability, and robustness in spatial understanding tasks.
Contribution
The paper presents a novel textual reasoning approach and a large-scale dataset that significantly improve VLM performance in affordance extraction for robotics.
Findings
Achieved 48.1% accuracy on W2P benchmark, a 9.6% improvement.
Model performance scales with reasoning data size.
Attention maps show interpretable, dynamic focus during reasoning.
Abstract
Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
