Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen; Aliaksandr Siarohin; Willi Menapace; Yuwei Fang; Kwot; Sin Lee; Ivan Skorokhodov; Kfir Aberman; Jun-Yan Zhu; Ming-Hsuan Yang; Sergey; Tulyakov

arXiv:2501.06187·cs.CV·March 21, 2025

Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot, Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey, Tulyakov

PDF

Open Access

TL;DR

Video Alchemist introduces a multi-subject, open-set personalization model for video generation that eliminates the need for time-consuming optimization, leveraging a new Diffusion Transformer and a comprehensive dataset and evaluation framework.

Contribution

The paper presents a novel multi-subject, open-set video personalization method with a Diffusion Transformer, along with a new dataset construction pipeline and benchmark for evaluation.

Findings

01

Outperforms existing methods in personalization quality

02

Supports multiple subjects and open-set scenarios

03

Eliminates test-time optimization

Abstract

Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology · Video Analysis and Summarization · Recommender Systems and Techniques

MethodsAttention Is All You Need · Absolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer