VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Kevin Qinghong Lin; Yuhao Zheng; Hangyu Ran; Dantong Zhu; Dongxing Mao; Linjie Li; Philip Torr; Alex Jinpeng Wang

arXiv:2511.02778·cs.CV·November 5, 2025

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang

PDF

Open Access 3 Reviews

TL;DR

VCode introduces a multimodal coding benchmark using SVGs as symbolic visual representations, revealing current language-centric models' limitations and proposing an agentic framework to improve visual reasoning and fidelity.

Contribution

The paper presents VCode, a new benchmark for multimodal understanding with SVGs, and VCoder, a framework that enhances vision-language models' ability to generate faithful symbolic visual code.

Findings

01

Frontier VLMs struggle with faithful SVG generation.

02

VCoder improves model performance by 12.3 points over top baseline.

03

Humans and models perform worse on SVGs, indicating challenges in symbolic visual reasoning.

Abstract

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. Extending language-centric coding to a new visual-centric coding task is an interesting and novel research direction. 2. This paper converts the multimodal understanding task into a visual-centric coding task and utilizes a Visual Model (VLM) to evaluate whether the generated code is an adequate and faithful visual representation. 3. The proposed VCoder framework is equipped with two capabilities: thinking with revision and acting with visual tools. Experimental results demonstrate the

Weaknesses

1. The dataset in this paper was not processed; it simply used the original images and QA from MM-Vet, MMMU, and CV-Bench. Since the SVG code is entirely generated by the VLM being evaluated, the authors only proposed SVG code generation as a benchmark approach. This benchmark does not design a unified principle for SVG code generation to guide subsequent VLM generation. The lack of a unified principle for SVG code generation can easily lead to instability in the generated code, resulting in uns

Reviewer 02Rating 0Confidence 5

Strengths

1. The idea of using SVG as an intermediate symbolic space for vision-language reasoning is conceptually novel and touches on an underexplored direction in multimodal representation. 2. The work incorporates test-time revision and tool-assisted perception, which reflects awareness of limitations in current models and attempts to address them through modular augmentation rather than purely scaling.

Weaknesses

1. The evaluation protocol is fragile: SigLIP similarity offers weak guarantees on fine-grained structure, and CodeVQA depends on the answering model’s biases and failure modes, making correctness a function of the evaluator rather than the representation. This undermines reliability and fairness, which is critical for a benchmark. 2. The dataset is almost entirely repurposed from prior benchmarks without substantial new curation or justification for domain coverage, scale, or annotation qualit

Reviewer 03Rating 6Confidence 3

Strengths

1) The paper introduces a novel paradigm: treating image understanding as code generation (SVG rendering). 2) The VCoder framework combining iterative refinement and external visual tools aligns with recent trends in agentic model enhancement. 3) Experiments are comprehensive, covering both closed- and open-source VLMs with detailed ablations (revision loops, tool usage, modality inputs).

Weaknesses

1) The dataset contains only 464 image–question pairs, which is small compared to major multimodal benchmarks. Although the repurposing from MM-Vet/MMMU/CV-Bench ensures diversity, it may limit generalization and statistical reliability of reported differences. 2) CodeVQA uses an external policy model (GPT-4o-mini) as evaluator. This introduces evaluation bias and circularity, especially since some tested models are from the same family. 3) While the paper argues that SVG captures symbolic abstr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Software Engineering Research