High Fidelity Text to Image Generation with Contrastive Alignment and Structural Guidance
Danyi Gao

TL;DR
This paper introduces a novel high-fidelity text-to-image generation method that combines contrastive learning for semantic alignment with structural guidance to improve image quality and structural accuracy.
Contribution
It proposes a joint framework integrating contrastive constraints and structural priors, enhancing semantic matching and structural fidelity in generated images.
Findings
Outperforms existing methods on COCO-2014 in CLIP Score, FID, and SSIM.
Effectively balances semantic alignment and structural fidelity without added computational cost.
Demonstrates improved controllability and detail in generated images.
Abstract
This paper addresses the performance bottlenecks of existing text-driven image generation methods in terms of semantic alignment accuracy and structural consistency. A high-fidelity image generation method is proposed by integrating text-image contrastive constraints with structural guidance mechanisms. The approach introduces a contrastive learning module that builds strong cross-modal alignment constraints to improve semantic matching between text and image. At the same time, structural priors such as semantic layout maps or edge sketches are used to guide the generator in spatial-level structural modeling. This enhances the layout completeness and detail fidelity of the generated images. Within the overall framework, the model jointly optimizes contrastive loss, structural consistency loss, and semantic preservation loss. A multi-objective supervision mechanism is adopted to improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
