TL;DR
RoboSVG is a comprehensive framework that generates high-quality, interactive SVG graphics guided by multiple modalities, supported by a large dataset and outperforming existing methods in versatility and accuracy.
Contribution
The paper introduces RoboSVG, a novel multimodal framework for SVG generation, and presents RoboDraw, a large-scale dataset for training and evaluating such models.
Findings
RoboSVG achieves superior query compliance and visual fidelity.
The framework effectively integrates textual, visual, and numerical guidance.
RoboDraw enables systematic study of diverse SVG generation tasks.
Abstract
Scalable Vector Graphics (SVGs) are fundamental to digital design and robot control, encoding not only visual structure but also motion paths in interactive drawings. In this work, we introduce RoboSVG, a unified multimodal framework for generating interactive SVGs guided by textual, visual, and numerical signals. Given an input query, the RoboSVG model first produces multimodal guidance, then synthesizes candidate SVGs through dedicated generation modules, and finally refines them under numerical guidance to yield high-quality outputs. To support this framework, we construct RoboDraw, a large-scale dataset of one million examples, each pairing an SVG generation condition (e.g., text, image, and partial SVG) with its corresponding ground-truth SVG code. RoboDraw dataset enables systematic study of four tasks, including basic generation (Text-to-SVG, Image-to-SVG) and interactive…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is clearly written. The technical contributions and key ideas are easy to follow. - The authors present comprehensive analyses across diverse SVG generation tasks, including different conditioning inputs and cases where partial SVGs or input images are present or absent. - The user study provides useful evidence supporting the superior performance of the proposed method.
- While the paper reads somewhat like a positioning paper introducing the interactive SVG generation task, its novelty is limited: (1) SVG generation has already been studied in prior work, (2) the dataset is a curated version of existing datasets rather than newly collected, and (3) the method relies heavily on existing foundation models. - The paper lacks clear technical novelty or methodological contributions. While such positioning papers may fit better in NLP venues, I think ICLR generally
1. The paper is very well presented keeping minute details in mind. 2. The paper clearly defines a practical and useful problem of interactive SVG generation which is a logical step beyond one-shot static generation. 3. Experiments shows system achieves strong empirical results, consistently outperforming strong zero-shot baselines and existing specialized models on their new benchmark.
1. The paper presents a system, not a novel method. This seems more like an engineering system rather than a well-defined mathematical formulation. 2. The main baselines, GPT-4o and Qwen-72B, are run in a zero-shot setting. RoboSVG is fine-tuned on 1M samples from the RoboDraw dataset. This is an apples-to-oranges comparison. The specialized, fine-tuned model will win in any scenario here.
1. Introduces interactive SVG generation tasks (PartialSVG-to-SVG, PartialImage-to-SVG) which are novel and practical 2. RoboDraw dataset enables systematic study of these tasks 3. First comprehensive benchmark comparing MLLMs and specialized models on SVG generation
1. Limited technical novelty - the approach primarily combines existing components (Qwen-2.5-VL backbone, FLUX.1 for guidance, etc.) 2. RoboDraw is constructed from existing datasets (MMSVG-2M, SVGX) with filtering and processing, not entirely original data collection 3. The "unified framework" is essentially task-specific modules with candidate selection, which is relatively straightforward 4. No significant algorithmic innovations beyond engineering existing techniques
1. This work provides a clear and precise definition of the proposed interactive generation framework. 2. It explicitly defines the different task variants (e.g., PartialSVG-to-SVG and PartialImage-to-SVG), enhancing clarity and reproducibility. 3. The method is evaluated from multiple perspectives and rigorously compared against several prior approaches, demonstrating its effectiveness and versatility.
My main concerns relate to the experimental design and the fairness of comparisons 1. **Ensemble fairness**: Your method uses three SVG generation modules and selects the best output via numerical guidance. Have you evaluated whether baseline models (e.g., GPT-4o, Qwen2.5-VL) also benefit from multi-trial generation (e.g., running 2–3 times with different seeds and selecting the best)? Without this, the comparison may be biased in favor of your pipeline. 2. **Two-stage generation for partial ta
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
