Weak Supervision Dynamic KL-Weighted Diffusion Models Guided by Large Language Models
Julian Perry, Frank Sanders, Carter Scott

TL;DR
This paper introduces a hybrid text-to-image generation method combining large language models with diffusion models, utilizing a dynamic KL-weighting strategy to enhance image quality, relevance, and training stability.
Contribution
It presents a novel dynamic KL-weighting technique and integrates semantic guidance from LLMs to improve diffusion-based image synthesis from text.
Findings
Outperforms traditional GANs in image quality and relevance
Enhances training stability and robustness to textual variability
Demonstrates scalability to other multimodal tasks
Abstract
In this paper, we presents a novel method for improving text-to-image generation by combining Large Language Models (LLMs) with diffusion models, a hybrid approach aimed at achieving both higher quality and efficiency in image synthesis from text descriptions. Our approach introduces a new dynamic KL-weighting strategy to optimize the diffusion process, along with incorporating semantic understanding from pre-trained LLMs to guide the generation process. The proposed method significantly improves both the visual quality and alignment of generated images with text descriptions, addressing challenges such as computational inefficiency, instability in training, and robustness to textual variability. We evaluate our method on the COCO dataset and demonstrate its superior performance over traditional GAN-based models, both quantitatively and qualitatively. Extensive experiments, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling
