AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Sharath Girish; Viacheslav Ivanov; Tsai-Shien Chen; Hao Chen; Aliaksandr Siarohin; Sergey Tulyakov

arXiv:2512.10943·cs.CV·December 12, 2025

AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

Sharath Girish, Viacheslav Ivanov, Tsai-Shien Chen, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov

PDF

Open Access

TL;DR

AlcheMinT introduces a novel framework for fine-grained temporal control in multi-subject video generation, enabling precise manipulation of subject appearance over time while maintaining high visual quality.

Contribution

The paper presents a new positional encoding mechanism and a unified approach that allows explicit temporal control in subject-driven video synthesis without additional cross-attention modules.

Findings

01

Achieves state-of-the-art visual quality in personalized videos

02

Enables precise temporal control over multiple subjects

03

Establishes a benchmark for temporal adherence and identity preservation

Abstract

Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis