Mimir: Improving Video Diffusion Models for Precise Text Understanding

Shuai Tan; Biao Gong; Yutong Feng; Kecheng Zheng; Dandan Zheng; Shuwei; Shi; Yujun Shen; Jingdong Chen; Ming Yang

arXiv:2412.03085·cs.CV·December 5, 2024

Mimir: Improving Video Diffusion Models for Precise Text Understanding

Shuai Tan, Biao Gong, Yutong Feng, Kecheng Zheng, Dandan Zheng, Shuwei, Shi, Yujun Shen, Jingdong Chen, Ming Yang

PDF

Open Access 1 Models

TL;DR

Mimir enhances text-to-video generation by integrating large language models with video diffusion models, improving text understanding and video quality, especially for short captions and dynamic motions.

Contribution

This work introduces Mimir, a novel end-to-end framework that harmonizes text encoders and LLMs for improved video generation from text descriptions.

Findings

01

High-quality video generation with better text comprehension.

02

Effective handling of short captions and shifting motions.

03

Demonstrated superiority over existing models through extensive experiments.

Abstract

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Shuaishuai0219/Animate-X-plusplus
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Music and Audio Processing · Image Retrieval and Classification Techniques

MethodsDiffusion