TL;DR
Vision2Code introduces a comprehensive, multi-domain benchmark for evaluating image-to-code generation models without relying on reference code, emphasizing domain-specific accuracy and human-aligned evaluation.
Contribution
It provides a new reference-code-free benchmark with diverse datasets, a novel evaluation framework, and insights into domain-dependent model performance and training improvements.
Findings
Models perform well on charts and graphs but poorly on spatial scenes and diagrams.
Evaluation aligns better with human judgment than previous methods.
Training with filtered outputs improves model performance on the benchmark.
Abstract
Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable reference code, or rely on generic rubrics that miss domain-specific reconstruction errors. We introduce Vision2Code, a reference-code-free benchmark and evaluation framework for multi-domain image-to-code generation. Vision2Code contains 2,169 test examples from 15 source datasets that span charts and plots, geometry, graphs, scientific imagery, documents, and 3D spatial scenes. Models generate executable programs, which we render and score against the source image using a VLM rater with dataset-specific rubrics and deterministic guardrails for severe semantic failures. We report render-success diagnostics that separate code execution failures from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
