TL;DR
This paper introduces Time Interval Encoding (TIE), a novel method for representing time intervals in video generation models, significantly enhancing temporal control and accuracy in overlapping event scenarios.
Contribution
TIE is a principled, plug-and-play interval-aware extension of rotary embeddings that improves temporal modeling in diffusion transformers for video generation.
Findings
TIE improves Temporal Constraint Satisfaction Rate from 77.34% to 96.03%.
TIE reduces temporal boundary error from 0.261s to 0.073s.
TIE enhances temporal alignment metrics in video generation.
Abstract
Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events -- a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
