Mind the Time: Temporally-Controlled Multi-Event Video Generation
Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei, Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov

TL;DR
MinT is a novel multi-event video generation model that enables precise temporal control over events within videos, producing coherent sequences by binding events to specific time periods and using a new time-aware encoding.
Contribution
We introduce MinT, the first model to provide temporal control over multiple events in video generation, utilizing a novel time-based positional encoding called ReRoPE.
Findings
MinT outperforms existing models in generating temporally controlled videos.
The ReRoPE encoding effectively guides cross-attention for temporal coherence.
Experiments show improved event ordering and timing accuracy.
Abstract
Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Data Visualization and Analytics · Computer Graphics and Visualization Techniques
MethodsDiffusion · Focus
