SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

Ximing Xing; Juncheng Hu; Ziteng Xue; Jing Zhang; Buyu Li; Sheng Wang; Dong Xu; Qian Yu

arXiv:2412.10437·cs.CV·April 10, 2026

SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

Ximing Xing, Juncheng Hu, Ziteng Xue, Jing Zhang, Buyu Li, Sheng Wang, Dong Xu, Qian Yu

PDF

2 Datasets

TL;DR

SVGFusion introduces a novel VAE-diffusion transformer framework that significantly improves the quality and editability of SVGs generated from text, overcoming structural and error issues of previous models.

Contribution

The paper presents SVGFusion, a unified VAE-diffusion architecture with a novel latent space and rendering sequence modeling for high-quality, editable SVG generation from text.

Findings

01

Achieved state-of-the-art results on SVGX-Dataset with 240k SVGs.

02

Generated SVGs are high-quality, semantically aligned, and editable.

03

Model outperforms existing LLM-based and optimization methods.

Abstract

Generating high-quality Scalable Vector Graphics (SVGs) from text remains a significant challenge. Existing LLM-based models that generate SVG code as a flat token sequence struggle with poor structural understanding and error accumulation, while optimization-based methods are slow and yield uneditable outputs. To address these limitations, we introduce SVGFusion, a unified framework that adapts the VAE-diffusion architecture to bridge the dual code-visual nature of SVGs. Our model features two core components: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) that learns a perceptually rich latent space by jointly encoding SVG code and its rendered image, and a Vector Space Diffusion Transformer (VS-DiT) that achieves globally coherent compositions through iterative refinement. Furthermore, this architecture is enhanced by a Rendering Sequence Modeling strategy, which ensures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.