Qwen-Image-VAE-2.0 Technical Report

Zekai Zhang; Deqing Li; Kuan Cao; Yujia Wu; Chenfei Wu; Yu Wu; Liang Peng; Hao Meng; Jiahao Li; Jie Zhang; Kaiyuan Gao; Kun Yan; Lihan Jiang; Ningyuan Tang; Shengming Yin; Tianhe Wu; Xiao Xu; Xiaoyue Chen; Yan Shu; Yanran Zhang; Yilei Chen; Yixian Xu; Yuxiang Chen; Zhendong Wang; Zihao Liu; Zikai Zhou; Yiliang Gu; Yi Wang; Xiaoxiao Xu; Lin Qu

arXiv:2605.13565·cs.CV·May 14, 2026

Qwen-Image-VAE-2.0 Technical Report

Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Yuxiang Chen, Zhendong Wang

PDF

1 Repo 1 Datasets

TL;DR

Qwen-Image-VAE-2.0 introduces advanced high-compression VAEs with improved architecture, training on billions of images, and a new benchmark, achieving state-of-the-art reconstruction and diffusability, especially in text-rich scenarios.

Contribution

The paper presents Qwen-Image-VAE-2.0 with novel architecture enhancements, large-scale training, and a new benchmark, advancing high-compression image reconstruction and diffusion capabilities.

Findings

01

Achieves state-of-the-art reconstruction performance.

02

Excels in text-rich scenarios with high compression.

03

Demonstrates superior diffusability and faster convergence.

Abstract

We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibaba/OmniDoc-TokenBench
github

Datasets

alibabagroup/OmniDoc-TokenBench
dataset· 2.1k dl
2.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.