Contextualized Diffusion Models for Text-Guided Image and Video Generation
Ling Yang, Zhilong Zhang, Zhaochen Yu, Jingwei Liu, Minkai Xu, Stefano, Ermon, Bin Cui

TL;DR
This paper introduces ContextDiff, a novel diffusion model that incorporates cross-modal context into both forward and reverse processes, significantly improving text-guided image and video generation quality and semantic alignment.
Contribution
The paper proposes a general contextualized diffusion framework that integrates text-visual interactions into all diffusion process steps, enhancing semantic consistency in generated visuals.
Findings
Achieves state-of-the-art results in text-to-image generation.
Enhances semantic alignment between text and generated visuals.
Effective in text-to-video editing tasks.
Abstract
Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship · Human Motion and Animation · 3D Modeling in Geospatial Applications
MethodsDiffusion · Focus
