Enhancing Motion in Text-to-Video Generation with Decomposed Encoding   and Conditioning

Penghui Ruan; Pichao Wang; Divya Saxena; Jiannong Cao; Yuhui Shi

arXiv:2410.24219·cs.CV·November 1, 2024

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

Penghui Ruan, Pichao Wang, Divya Saxena, Jiannong Cao, Yuhui Shi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces DEMO, a novel framework that decomposes text encoding and conditioning into content and motion components, significantly improving motion realism in text-to-video generation.

Contribution

The paper proposes a new decomposed encoding and conditioning framework for T2V that explicitly models motion, addressing limitations of previous static or minimally dynamic video outputs.

Findings

01

DEMO outperforms existing models in motion dynamics on multiple benchmarks.

02

Enhanced motion understanding leads to more realistic and dynamic videos.

03

The approach maintains high visual quality while improving motion synthesis.

Abstract

Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pr-ryan/demo
pytorchOfficial

Videos

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning· slideslive

Taxonomy

TopicsHuman Motion and Animation · Video Analysis and Summarization · Multimedia Communication and Technology