Thinking with Programming Vision: Towards a Unified View for Thinking with Images
Zirun Guo, Minjie Hong, Feng Zhang, Kai Jia, Tao Jin

TL;DR
This paper introduces CodeVision, a scalable framework enabling large language models to generate code as a universal interface for robust image reasoning, significantly improving their performance and flexibility in visual tasks.
Contribution
It proposes a novel code-as-tool framework with a two-stage training process and new datasets, enhancing robustness and multi-tool reasoning in multimodal models.
Findings
Improved robustness to image orientation and corruption.
Enhanced multi-tool composition and execution efficiency.
Better error recovery from runtime feedback.
Abstract
Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
