SpatialTraceGen: High-Fidelity Traces for Efficient VLM Spatial Reasoning Distillation
Gio Huh, Dhruv Sheth, Rayhan Zirvi, Frank Xiao

TL;DR
SpatialTraceGen creates high-quality, step-by-step reasoning datasets for vision-language models, enabling more efficient spatial reasoning training through automated verification and distillation from large models.
Contribution
The paper introduces SpatialTraceGen, a novel framework that distills reasoning traces from large models into high-quality datasets with automated fidelity verification.
Findings
Improves trace quality score by 17%
Reduces variance of trace quality by over 40%
Provides structured reasoning datasets for better fine-tuning
Abstract
While Vision-Language Models (VLMs) excel in many areas, they struggle with complex spatial reasoning, which requires problem decomposition and strategic tool use. Fine-tuning smaller, more deployable models offers an efficient path to strong performance, but this is hampered by a major bottleneck: the absence of high-quality, step-by-step reasoning data. To address this data-efficiency gap, we introduce SpatialTraceGen, a framework to distill the reasoning processes of a large teacher model into a high-quality dataset of multi-hop, multi-tool reasoning traces. A key innovation is our automated Verifier, which scalably ensures the fidelity of each reasoning step, providing a cost-effective alternative to manual human annotation. On the CLEVR-Humans benchmark, this verifier-guided process improves the average quality score of traces by 17\% while reducing quality variance by over 40\%.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Topic Modeling
