
TL;DR
This paper introduces Event-Driven Video Generation (EVD), a framework that explicitly models interactions in text-to-video synthesis, significantly improving the realism and consistency of dynamic scenes by focusing updates on active events.
Contribution
EVD is a novel, minimal framework that incorporates event grounding into video generation, addressing common failure modes in existing models by explicitly modeling interactions.
Findings
EVD improves human preference scores and dynamic scene accuracy.
EVD reduces failure modes like object drift and broken support relations.
Explicit event grounding enhances interaction realism in generated videos.
Abstract
State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
