SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation
Ximing Xing, Juncheng Hu, Ziteng Xue, Jing Zhang, Buyu Li, Sheng Wang, Dong Xu, Qian Yu

TL;DR
SVGFusion introduces a novel VAE-diffusion transformer framework that significantly improves the quality and editability of SVGs generated from text, overcoming structural and error issues of previous models.
Contribution
The paper presents SVGFusion, a unified VAE-diffusion architecture with a novel latent space and rendering sequence modeling for high-quality, editable SVG generation from text.
Findings
Achieved state-of-the-art results on SVGX-Dataset with 240k SVGs.
Generated SVGs are high-quality, semantically aligned, and editable.
Model outperforms existing LLM-based and optimization methods.
Abstract
Generating high-quality Scalable Vector Graphics (SVGs) from text remains a significant challenge. Existing LLM-based models that generate SVG code as a flat token sequence struggle with poor structural understanding and error accumulation, while optimization-based methods are slow and yield uneditable outputs. To address these limitations, we introduce SVGFusion, a unified framework that adapts the VAE-diffusion architecture to bridge the dual code-visual nature of SVGs. Our model features two core components: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) that learns a perceptually rich latent space by jointly encoding SVG code and its rendered image, and a Vector Space Diffusion Transformer (VS-DiT) that achieves globally coherent compositions through iterative refinement. Furthermore, this architecture is enhanced by a Rendering Sequence Modeling strategy, which ensures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
