SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation
Yuhan Pei, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, Yu Wu

TL;DR
This paper introduces SOW, a novel diffusion-based method leveraging MLLMs to improve contextual coherence and detail preservation in text-vision-to-image generation, addressing interference issues in diffusion models.
Contribution
It proposes COW and SOW frameworks that control information diffusion using MLLMs, enhancing semantic and spatial coherence in image generation without additional training.
Findings
SOW achieves pixel-level condition fidelity in image generation.
Controlled diffusion improves semantic and visual coherence.
Experiments demonstrate the effectiveness of the proposed methods.
Abstract
Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
