TRACE: Textual Reasoning for Affordance Coordinate Extraction

Sangyun Park; Jin Kim; Yuchen Cui; Matthew S. Brown

arXiv:2511.01999·cs.RO·November 5, 2025

TRACE: Textual Reasoning for Affordance Coordinate Extraction

Sangyun Park, Jin Kim, Yuchen Cui, Matthew S. Brown

PDF

Open Access 1 Datasets

TL;DR

This paper introduces TRACE, a method that enhances vision-language models for robotic affordance prediction by integrating textual reasoning, leading to improved accuracy, interpretability, and robustness in spatial understanding tasks.

Contribution

The paper presents a novel textual reasoning approach and a large-scale dataset that significantly improve VLM performance in affordance extraction for robotics.

Findings

01

Achieved 48.1% accuracy on W2P benchmark, a 9.6% improvement.

02

Model performance scales with reasoning data size.

03

Attention maps show interpretable, dynamic focus during reasoning.

Abstract

Vision-Language Models (VLMs) struggle to translate high-level instructions into the precise spatial affordances required for robotic manipulation. While visual Chain-of-Thought (CoT) methods exist, they are often computationally intensive. In this work, we introduce TRACE (Textual Reasoning for Affordance Coordinate Extraction), a novel methodology that integrates a textual Chain of Reasoning (CoR) into the affordance prediction process. We use this methodology to create the TRACE dataset, a large-scale collection created via an autonomous pipeline that pairs instructions with explicit textual rationales. By fine-tuning a VLM on this data, our model learns to externalize its spatial reasoning before acting. Our experiments show that our TRACE-tuned model achieves state-of-the-art performance, reaching 48.1% accuracy on the primary Where2Place (W2P) benchmark (a 9.6% relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

jink-ucla/TRACE
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI