WildSVG: Towards Reliable SVG Generation Under Real-Word Conditions
Marco Terral, Haotian Zhang, Tianyang Zhang, Meng Lin, Xiaoqing Xie, Haoran Dai, Darsh Kaushik, Pai Peng, Nicklas Scharpff, David Vazquez, Joan Rodriguez

TL;DR
This paper introduces WildSVG, a new benchmark dataset for evaluating SVG extraction from real-world images, revealing current models' limitations and suggesting iterative refinement as a promising improvement path.
Contribution
The paper presents WildSVG, the first benchmark for SVG extraction in real-world conditions, and evaluates existing models, highlighting their shortcomings and potential directions for enhancement.
Findings
Current models underperform on real-world SVG extraction tasks.
Iterative refinement methods show promise for improving accuracy.
WildSVG provides a foundation for systematic benchmarking.
Abstract
We introduce the task of SVG extraction, which consists in translating specific visual inputs from an image into scalable vector graphics. Existing multimodal models achieve strong results when generating SVGs from clean renderings or textual descriptions, but they fall short in real-world scenarios where natural images introduce noise, clutter, and domain shifts. A central challenge in this direction is the lack of suitable benchmarks. To address this need, we introduce the WildSVG Benchmark, formed by two complementary datasets: Natural WildSVG, built from real images containing company logos paired with their SVG annotations, and Synthetic WildSVG, which blends complex SVG renderings into real scenes to simulate difficult conditions. Together, these resources provide the first foundation for systematic benchmarking SVG extraction. We benchmark state-of-the-art multimodal models and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The authors evaluate a wide range of state-of-the-art vision-language models (Qwen, Gemini, Claude, GPT, StarVector, GLM) on both Natural and Synthetic WildSVG test sets. 2. They use four complementary metrics (L2, SSIM for pixel fidelity; LPIPS, DINO for perceptual/semantic similarity), which is an appropriate choice to capture different aspects of the generated SVG.
1. How often did the VLLM-based SVG matching fail or produce incorrect logo–SVG pairs in Natural WildSVG? Were any manual checks done, and how sensitive are the results to these mismatches? 2. Can you clarify how the “focus prompt” is formulated and used? If the prompt is ambiguous or generic, how does it affect the model’s output? 3. Can you provide more details on the synthetic data creation? How diverse are the embedded SVG contexts (lighting, occlusion, styles)? 4. How do you ensure that
1. The claims are well backed by experiments. Using both one-step and two-step evaluation settings helps disentangle localization from vectorization capabilities. The choice of metrics, covering both pixel-level and semantic similarity, is comprehensive.
1. The presentation of the paper is extremely poor and looks like it is written in hurry. 2. The test sets are worryingly small. This raises serious questions about the statistical significance of the reported results and the reliability of the benchmark for distinguishing between top-performing models where score differences are marginal. 3. The authors astutely identify that most VLLMs cheat by using SVG text primitives instead of drawing shapes. However, the chosen raster-based metrics (DINO,
1. This paper present the first benchmark dedicated to SVG extraction, comprising Natural WildSVG, focusing on real-world images paired with verified SVG annotations, and Synthetic WildSVG, focusing on natural images with synthetically embedded, complex SVGs. 2. They devise new evaluation protocols and multi-metric analysis. Besides, they benchmark several VLM on this task. 3. The paper outlines clear future directions and potential integration with multimodal LLM pipelines, encouraging further
1. From my perspective, the proposed task lacks sufficient innovation. 2. Limited benchmark diversity: Expansion to more diverse SVG types (beyond logos, include pictograms, diagrams, UI elements) could clarify task boundaries and model strengths. 3. Lack of Editing-based SVG extraction: It is possible to first use image editing to extract the target object into raster image, them vectorize the raster image into SVG. 4. More Robust Metrics: this paper only consider visual similarity as the metri
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques
