Mimir: Improving Video Diffusion Models for Precise Text Understanding
Shuai Tan, Biao Gong, Yutong Feng, Kecheng Zheng, Dandan Zheng, Shuwei, Shi, Yujun Shen, Jingdong Chen, Ming Yang

TL;DR
Mimir enhances text-to-video generation by integrating large language models with video diffusion models, improving text understanding and video quality, especially for short captions and dynamic motions.
Contribution
This work introduces Mimir, a novel end-to-end framework that harmonizes text encoders and LLMs for improved video generation from text descriptions.
Findings
High-quality video generation with better text comprehension.
Effective handling of short captions and shifting motions.
Demonstrated superiority over existing models through extensive experiments.
Abstract
Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text comprehension. The recent success of large language models (LLMs) showcases the power of decoder-only transformers, which offers three clear benefits for text-to-video (T2V) generation, namely, precise text understanding resulting from the superior scalability, imagination beyond the input text enabled by next token prediction, and flexibility to prioritize user interests through instruction tuning. Nevertheless, the feature distribution gap emerging from the two different text modeling paradigms hinders the direct use of LLMs in established T2V models. This work addresses this challenge with Mimir, an end-to-end training framework featuring a carefully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Image Retrieval and Classification Techniques
MethodsDiffusion
