High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Ji Woo Hong; Hee Suk Yoon; Gwanhyeong Koo; Eunseop Yoon; SooHwan Eom; Qi Dai; Chong Luo; Chang D. Yoo

arXiv:2603.13389·cs.CV·March 17, 2026

High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Ji Woo Hong, Hee Suk Yoon, Gwanhyeong Koo, Eunseop Yoon, SooHwan Eom, Qi Dai, Chong Luo, Chang D. Yoo

PDF

Open Access

TL;DR

This paper introduces a diffusion-based decoding framework that significantly improves the visual fidelity of text-to-image generation by leveraging pre-trained vision-language models without extensive retraining.

Contribution

It proposes a novel distribution-conditioned diffusion decoding method that enhances image quality while preserving the original VLMs, requiring only short training on ImageNet-1K.

Findings

01

Improves visual fidelity of VLM-based image generation

02

Achieves high-quality images with minimal additional training

03

Enhances both VQ-VAE reconstructions and text-to-image outputs

Abstract

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we propose a diffusion-based decoding framework that enhances image fidelity by training only a diffusion decoder on the output image-token logits of pre-trained VLMs, thereby preserving the original model intact. At its core, Logit-to-Code Distributional Mapping converts the VLM's image-token logits into continuous, distribution-weighted code vectors with uncertainty features, providing an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning