MOVi: Training-free Text-conditioned Multi-Object Video Generation

Aimon Rahman; Jiang Liu; Ze Wang; Ximeng Sun; Jialian Wu; Xiaodong Yu; Yusheng Su; Vishal M. Patel; Zicheng Liu; Emad Barsoum

arXiv:2505.22980·cs.CV·May 30, 2025

MOVi: Training-free Text-conditioned Multi-Object Video Generation

Aimon Rahman, Jiang Liu, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Yusheng Su, Vishal M. Patel, Zicheng Liu, Emad Barsoum

PDF

TL;DR

This paper introduces a training-free method for multi-object text-conditioned video generation that improves object interaction accuracy and motion realism by leveraging large language models and attention manipulation.

Contribution

It presents a novel training-free approach that uses LLMs as object trajectory directors and refines attention mechanisms to enhance multi-object video generation.

Findings

01

42% improvement in motion dynamics and object accuracy

02

Enhanced object-specific feature capture and motion patterns

03

Maintained high fidelity and smoothness in generated videos

Abstract

Recent advances in diffusion-based text-to-video (T2V) models have demonstrated remarkable progress, but these models still face challenges in generating videos with multiple objects. Most models struggle with accurately capturing complex object interactions, often treating some objects as static background elements and limiting their movement. In addition, they often fail to generate multiple distinct objects as specified in the prompt, resulting in incorrect generations or mixed features across objects. In this paper, we present a novel training-free approach for multi-object video generation that leverages the open world knowledge of diffusion models and large language models (LLMs). We use an LLM as the ``director'' of object trajectories, and apply the trajectories through noise re-initialization to achieve precise control of realistic movements. We further refine the generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Diffusion