AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation
Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov

TL;DR
AlcheMinT introduces a novel framework for fine-grained temporal control in multi-subject video generation, enabling precise manipulation of subject appearance over time while maintaining high visual quality.
Contribution
The paper presents a new positional encoding mechanism and a unified approach that allows explicit temporal control in subject-driven video synthesis without additional cross-attention modules.
Findings
Achieves state-of-the-art visual quality in personalized videos
Enables precise temporal control over multiple subjects
Establishes a benchmark for temporal adherence and identity preservation
Abstract
Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis
