Contextualized Diffusion Models for Text-Guided Image and Video   Generation

Ling Yang; Zhilong Zhang; Zhaochen Yu; Jingwei Liu; Minkai Xu; Stefano; Ermon; Bin Cui

arXiv:2402.16627·cs.CV·June 5, 2024·1 cites

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Ling Yang, Zhilong Zhang, Zhaochen Yu, Jingwei Liu, Minkai Xu, Stefano, Ermon, Bin Cui

PDF

Open Access 1 Repo

TL;DR

This paper introduces ContextDiff, a novel diffusion model that incorporates cross-modal context into both forward and reverse processes, significantly improving text-guided image and video generation quality and semantic alignment.

Contribution

The paper proposes a general contextualized diffusion framework that integrates text-visual interactions into all diffusion process steps, enhancing semantic consistency in generated visuals.

Findings

01

Achieves state-of-the-art results in text-to-image generation.

02

Enhances semantic alignment between text and generated visuals.

03

Effective in text-to-video editing tasks.

Abstract

Conditional diffusion models have exhibited superior performance in high-fidelity text-guided visual generation and editing. Nevertheless, prevailing text-guided visual diffusion models primarily focus on incorporating text-visual relationships exclusively into the reverse process, often disregarding their relevance in the forward process. This inconsistency between forward and reverse processes may limit the precise conveyance of textual semantics in visual synthesis results. To address this issue, we propose a novel and general contextualized diffusion model (ContextDiff) by incorporating the cross-modal context encompassing interactions and alignments between text condition and visual sample into forward and reverse processes. We propagate this context to all timesteps in the two processes to adapt their trajectories, thereby facilitating cross-modal conditional modeling. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangling0818/contextdiff
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Humanities and Scholarship · Human Motion and Animation · 3D Modeling in Geospatial Applications

MethodsDiffusion · Focus