CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao

TL;DR
CodeDance introduces a flexible, code-based visual reasoning framework that dynamically composes and executes tools, outperforming existing methods and demonstrating emergent behaviors without task-specific fine-tuning.
Contribution
It proposes a novel executable code approach for visual reasoning, enabling adaptive tool use, emergent behaviors, and superior performance over baselines and larger models.
Findings
CodeDance outperforms schema-driven and text-only baselines on multiple benchmarks.
Emergent behaviors such as novel tool invocation and cross-task transfer are observed during RL training.
CodeDance surpasses GPT-4o and larger open-source models in visual reasoning tasks.
Abstract
Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool calling, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
