Transductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning
Shengguang Wu, Xiaohan Wang, Yuhui Zhang, Hao Zhu, Serena Yeung-Levy

TL;DR
Transductive Visual Programming (TVP) introduces a self-evolving framework that learns and refines tools from experience, significantly improving spatial reasoning performance in 3D scene understanding tasks.
Contribution
TVP is the first framework to build new tools from experience rather than speculation, enabling continuous improvement in visual programming for spatial reasoning.
Findings
Achieves 22% better performance than GPT-4o on Omni3D-Bench.
Tools learned are used 5x more frequently than inductive tools.
Generalizes well to unseen spatial tasks without testset-specific tuning.
Abstract
Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art…
Peer Reviews
Decision·ICLR 2026 Poster
Innovative and Well-Structured Framework + The paper introduces Transductive Visual Programming (TVP), a novel and conceptually original framework that enables a model to iteratively learn reusable tools from its own problem-solving experience. Its dual-library closed-loop design (Example–Tool Library) is systematic and complete, effectively realizing a self-improving learning cycle. Strong presentation quality + The paper is clearly written, logically organized, and well-illustrated with info
Lack of Transparency in Evaluation Mechanisms The evaluation procedures governing both the Example Library and the Tool Library are under-specified, which raises concerns about reproducibility and interpretability. Specifically: + Unclear criteria for Example Library admission. Although the paper states that a VLM judge scores each generated program and admits examples whose quality exceeds a threshold of τq = 8.5, it does not define the concrete scoring dimensions—such as logical correctness,
- The core idea of "transductive abstraction" from a library of successful solutions is elegant and well-motivated. It ensures that created tools are practically useful and grounded in experience, which is a clear advantage over VADAR's more speculative, question-based induction (as clearly shown in Fig 2). - The zero-shot generalization results on the SpatialScore-Hard collection (Table 2, Fig 5) are a key strength. Showing that tools learned only on Omni3D-Bench are effective on completely di
- The TVP framework itself is extremely complex and computationally expensive. For each query, it makes multiple LLM/VLM calls (retrieve, generate m programs, execute m programs, judge m programs). It then has a heavy, periodic maintenance loop that involves more LLM calls for clustering, abstraction, validation, and merging. This "meta-cost" of running the TVP framework is not discussed but seems prohibitively high, likely many times more expensive than just running a baseline model. - As ment
- **Conceptual originality:** The paper introduces transductive tool evolution, which learns abstractions from experience rather than induction before use. This represents a genuine conceptual advance in visual programming and aligns well with human-like skill acquisition. - **Technical soundness:** The dual-library architecture and full algorithmic specification (program generation, clustering, abstraction, validation, and merging) are rigorous and clearly grounded. The validation mechanism e
- **Limited scope of evaluation:** While visual programming was originally designed for 2D visual reasoning and perception tasks, this paper evaluates TVP only on 3D spatial reasoning. It remains unclear whether the proposed transductive abstraction also benefits conventional 2D visual reasoning benchmarks (e.g., MME, MMMU). - **Heavy dependence on large proprietary models:** TVP’s components rely heavily on GPT-4o and its mini variants. It remains unclear how performance scales with smaller o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Artificial Intelligence in Games
