VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models
Lingjie Jiang, Shaohan Huang, Xun Wu, Yixia Li, Dongdong Zhang, Furu Wei

TL;DR
VisCodex is a unified multimodal framework that merges vision and coding models to significantly enhance code generation from visual and textual inputs, supported by a large-scale dataset and a new benchmark.
Contribution
We introduce VisCodex, a novel model merging approach for multimodal code generation, along with the Multimodal Coding Dataset and InfiBench-V benchmark for evaluation.
Findings
Achieves state-of-the-art performance among open-source MLLMs
Approaches performance of proprietary models like GPT-4o
Demonstrates the effectiveness of model merging and new datasets
Abstract
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a unified framework that seamlessly merges vision and coding language models to empower MLLMs with strong multimodal code generation abilities. Leveraging a task vector-based model merging technique, we integrate a state-of-the-art coding LLM into a strong vision-language backbone, while preserving both visual comprehension and advanced coding skills. To support training and evaluation, we introduce the Multimodal Coding Dataset (MCD), a large-scale and diverse collection of 598k samples, including high-quality HTML code, chart image-code pairs, image-augmented StackOverflow QA, and algorithmic problems. Furthermore, we propose InfiBench-V, a novel and…
Peer Reviews
Decision·ICLR 2026 Poster
- Solid and comprehensive contribution: The work offers not only a new training method (model merging) but also new datasets for both training and evaluation, which strengthens its empirical foundation. - Methodologically clear: the overall paper is well-written, organized, and easy to follow. - Breadth of evaluation: Covers diverse multimodal coding tasks, showing consistent, though not always dramatic, gains across small and medium-sized models. - Timely and relevant: Addresses the challeng
- Unclear advantage over standard fine-tuning: As shown in Table 2, model merging offers almost no improvement for large models (e.g., 33B variant), suggesting diminishing returns at scale. This weakens the claim of broad effectiveness. - Limited discussion on data-scarce scenarios: One key potential advantage of model merging could be in low-resource multimodal settings, yet this is not explored. It’s unclear whether the approach would still help when task-specific data is limited. - Lack of
1. Introduces a model-merging based path for multimodal code generation, combining vision and coding expertise without full retraining, and expands the problem space with new data and benchmarks. 2. Demonstrates solid empirical rigor, with extensive evaluations showing consistent performance improvements over strong open-source models and competitiveness with proprietary ones. 3. Addresses a meaningful and underexplored capability, turning visual content into functional code and offering pract
1. The comparison to direct SFT strategies is limited in scope; while the paper includes one- and two-stage baselines, a broader evaluation (e.g., LoRA tuning on both vision and language modules) would strengthen the claim that merging is strictly superior for this setting. 2. The dataset construction pipeline relies heavily on model-generated content (e.g., GPT-4o generated HTML and curated chart code data) but lacks detailed analyses of potential data bias, overfitting to synthetic structures
1. Clear, formalized merging recipe with explicit task-vector definitions and a single-parameter interpolation; 2. The paper introduces a compute-efficient design, only the LLM backbone is merged or tuned but vision & projector are frozen; 3. The authors introduce a dataset large, diverse MCD and a benchmark (InfiBench-V) targeting visually-rich programming questions; 4. It reaches several strong numbers on Design2Code/ChartMimic, using a 33B model close to GPT-4o on average.
1. The paper shows example items evaluated by "Judge: GPT-4o" to 50/50 component scores, which confirms the setup, without disclosing the actual threshold or how it was chosen. 2. The paper does include an unfreezing setup, but only for the replacement baseline. For the VisCodex, training freezes the vision encoder and projector and fine-tunes only the language backbone, and the paper does not report an ablation where these modules are unfrozen after merging.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
