VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang

TL;DR
VCode introduces a multimodal coding benchmark using SVGs as symbolic visual representations, revealing current language-centric models' limitations and proposing an agentic framework to improve visual reasoning and fidelity.
Contribution
The paper presents VCode, a new benchmark for multimodal understanding with SVGs, and VCoder, a framework that enhances vision-language models' ability to generate faithful symbolic visual code.
Findings
Frontier VLMs struggle with faithful SVG generation.
VCoder improves model performance by 12.3 points over top baseline.
Humans and models perform worse on SVGs, indicating challenges in symbolic visual reasoning.
Abstract
Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation.…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Extending language-centric coding to a new visual-centric coding task is an interesting and novel research direction. 2. This paper converts the multimodal understanding task into a visual-centric coding task and utilizes a Visual Model (VLM) to evaluate whether the generated code is an adequate and faithful visual representation. 3. The proposed VCoder framework is equipped with two capabilities: thinking with revision and acting with visual tools. Experimental results demonstrate the
1. The dataset in this paper was not processed; it simply used the original images and QA from MM-Vet, MMMU, and CV-Bench. Since the SVG code is entirely generated by the VLM being evaluated, the authors only proposed SVG code generation as a benchmark approach. This benchmark does not design a unified principle for SVG code generation to guide subsequent VLM generation. The lack of a unified principle for SVG code generation can easily lead to instability in the generated code, resulting in uns
1. The idea of using SVG as an intermediate symbolic space for vision-language reasoning is conceptually novel and touches on an underexplored direction in multimodal representation. 2. The work incorporates test-time revision and tool-assisted perception, which reflects awareness of limitations in current models and attempts to address them through modular augmentation rather than purely scaling.
1. The evaluation protocol is fragile: SigLIP similarity offers weak guarantees on fine-grained structure, and CodeVQA depends on the answering model’s biases and failure modes, making correctness a function of the evaluator rather than the representation. This undermines reliability and fairness, which is critical for a benchmark. 2. The dataset is almost entirely repurposed from prior benchmarks without substantial new curation or justification for domain coverage, scale, or annotation qualit
1) The paper introduces a novel paradigm: treating image understanding as code generation (SVG rendering). 2) The VCoder framework combining iterative refinement and external visual tools aligns with recent trends in agentic model enhancement. 3) Experiments are comprehensive, covering both closed- and open-source VLMs with detailed ablations (revision loops, tool usage, modality inputs).
1) The dataset contains only 464 image–question pairs, which is small compared to major multimodal benchmarks. Although the repurposing from MM-Vet/MMMU/CV-Bench ensures diversity, it may limit generalization and statistical reliability of reported differences. 2) CodeVQA uses an external policy model (GPT-4o-mini) as evaluator. This introduces evaluation bias and circularity, especially since some tested models are from the same family. 3) While the paper argues that SVG captures symbolic abstr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Software Engineering Research
