Hierarchical Vision-Language Alignment for Text-to-Image Generation via   Diffusion Models

Emily Johnson; Noah Wilson

arXiv:2501.00917·cs.CV·January 3, 2025

Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models

Emily Johnson, Noah Wilson

PDF

Open Access

TL;DR

This paper presents VLAD, a hierarchical diffusion model that improves text-to-image generation by better aligning complex textual descriptions with high-quality images through semantic decomposition and multi-stage diffusion.

Contribution

Introduces VLAD, a novel hierarchical diffusion framework with semantic alignment modules for enhanced text-to-image synthesis performance.

Findings

01

VLAD outperforms state-of-the-art methods on MARIO-Eval and INNOVATOR-Eval benchmarks.

02

VLAD achieves higher image quality and semantic alignment in experiments.

03

Human evaluations favor VLAD's generated images over competitors.

Abstract

Text-to-image generation has witnessed significant advancements with the integration of Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual descriptions with high-quality, visually coherent images. This paper introduces the Vision-Language Aligned Diffusion (VLAD) model, a generative framework that addresses these challenges through a dual-stream strategy combining semantic alignment and hierarchical diffusion. VLAD utilizes a Contextual Composition Module (CCM) to decompose textual prompts into global and local representations, ensuring precise alignment with visual features. Furthermore, it incorporates a multi-stage diffusion process with hierarchical guidance to generate high-fidelity images. Experiments conducted on MARIO-Eval and INNOVATOR-Eval benchmarks demonstrate that VLAD significantly outperforms state-of-the-art methods in terms of image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Digital Humanities and Scholarship

MethodsDiffusion