SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Yuhan Pei; Ruoyu Wang; Yongqi Yang; Ye Zhu; Olga Russakovsky; Yu Wu

arXiv:2411.19182·cs.CV·May 8, 2026

SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation

Yuhan Pei, Ruoyu Wang, Yongqi Yang, Ye Zhu, Olga Russakovsky, Yu Wu

PDF

TL;DR

This paper introduces SOW, a novel diffusion-based method leveraging MLLMs to improve contextual coherence and detail preservation in text-vision-to-image generation, addressing interference issues in diffusion models.

Contribution

It proposes COW and SOW frameworks that control information diffusion using MLLMs, enhancing semantic and spatial coherence in image generation without additional training.

Findings

01

SOW achieves pixel-level condition fidelity in image generation.

02

Controlled diffusion improves semantic and visual coherence.

03

Experiments demonstrate the effectiveness of the proposed methods.

Abstract

Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.