VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

Xuanle Zhao; Deyang Jiang; Zhixiong Zeng; Lei Chen; Haibo Qiu; Jing Huang; Yufeng Zhong; Liming Zheng; Yilin Cao; Lin Ma

arXiv:2511.00391·cs.CV·December 1, 2025

VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning

Xuanle Zhao, Deyang Jiang, Zhixiong Zeng, Lei Chen, Haibo Qiu, Jing Huang, Yufeng Zhong, Liming Zheng, Yilin Cao, Lin Ma

PDF

Open Access 2 Datasets

TL;DR

VinciCoder is a unified multimodal code generation model that uses a two-stage training process, including supervised fine-tuning and visual reinforcement learning, to improve performance across diverse benchmarks.

Contribution

The paper introduces VinciCoder, a novel model that unifies multimodal code generation with a coarse-to-fine visual reinforcement learning strategy, enhancing generalization and visual fidelity.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

The coarse-to-fine ViRL strategy significantly improves visual fidelity.

03

Large-scale dataset of 1.6M image-code pairs supports training.

Abstract

Multimodal code generation has garnered significant interest within the research community. Despite the notable success of recent vision-language models (VLMs) on specialized tasks like chart-to-code generation, their reliance on single-task training regimens fosters a narrow paradigm that hinders the development of generalized \textbf{VI}sio\textbf{N} \textbf{C}ode \textbf{I}ntelligence. In this work, we introduce \textbf{VinciCoder}, a unified multimodal code generation model that addresses this limitation via a two-stage training framework. We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs for tasks involving direct code generation and visual-based code refinement. Subsequently, we introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity by calculating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning