VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Lingjie Jiang; Shaohan Huang; Xun Wu; Yixia Li; Dongdong Zhang; Furu Wei

arXiv:2508.09945·cs.CL·August 14, 2025

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, Furu Wei

PDF

3 Datasets 3 Reviews

TL;DR

VisCodex is a unified multimodal framework that merges vision and coding models to significantly enhance code generation from visual and textual inputs, supported by a large-scale dataset and a new benchmark.

Contribution

We introduce VisCodex, a novel model merging approach for multimodal code generation, along with the Multimodal Coding Dataset and InfiBench-V benchmark for evaluation.

Findings

01

Achieves state-of-the-art performance among open-source MLLMs

02

Approaches performance of proprietary models like GPT-4o

03

Demonstrates the effectiveness of model merging and new datasets

Abstract

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Solid and comprehensive contribution: The work offers not only a new training method (model merging) but also new datasets for both training and evaluation, which strengthens its empirical foundation. - Methodologically clear: the overall paper is well-written, organized, and easy to follow. - Breadth of evaluation: Covers diverse multimodal coding tasks, showing consistent, though not always dramatic, gains across small and medium-sized models. - Timely and relevant: Addresses the challeng

Weaknesses

- Unclear advantage over standard fine-tuning: As shown in Table 2, model merging offers almost no improvement for large models (e.g., 33B variant), suggesting diminishing returns at scale. This weakens the claim of broad effectiveness. - Limited discussion on data-scarce scenarios: One key potential advantage of model merging could be in low-resource multimodal settings, yet this is not explored. It’s unclear whether the approach would still help when task-specific data is limited. - Lack of

Reviewer 02Rating 6Confidence 5

Strengths

1. Introduces a model-merging based path for multimodal code generation, combining vision and coding expertise without full retraining, and expands the problem space with new data and benchmarks. 2. Demonstrates solid empirical rigor, with extensive evaluations showing consistent performance improvements over strong open-source models and competitiveness with proprietary ones. 3. Addresses a meaningful and underexplored capability, turning visual content into functional code and offering pract

Weaknesses

1. The comparison to direct SFT strategies is limited in scope; while the paper includes one- and two-stage baselines, a broader evaluation (e.g., LoRA tuning on both vision and language modules) would strengthen the claim that merging is strictly superior for this setting. 2. The dataset construction pipeline relies heavily on model-generated content (e.g., GPT-4o generated HTML and curated chart code data) but lacks detailed analyses of potential data bias, overfitting to synthetic structures

Reviewer 03Rating 6Confidence 4

Strengths

1. Clear, formalized merging recipe with explicit task-vector definitions and a single-parameter interpolation; 2. The paper introduces a compute-efficient design, only the LLM backbone is merged or tuned but vision & projector are frozen; 3. The authors introduce a dataset large, diverse MCD and a benchmark (InfiBench-V) targeting visually-rich programming questions; 4. It reaches several strong numbers on Design2Code/ChartMimic, using a 33B model close to GPT-4o on average.

Weaknesses

1. The paper shows example items evaluated by "Judge: GPT-4o" to 50/50 component scores, which confirms the setup, without disclosing the actual threshold or how it was chosen. 2. The paper does include an unfreezing setup, but only for the replacement baseline. For the VisCodex, training freezes the vision encoder and projector and fine-tunes only the language backbone, and the paper does not report an ablation where these modules are unfrozen after merging.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.