OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai; He Zhang; Xi Chen; Jinbo Xing; Yiwei Hu; Yuqian Zhou; Kai Zhang; Zhifei Zhang; Soo Ye Kim; Tianyu Wang; Yulun Zhang; Xiaokang Yang; Zhe Lin; Alan Yuille

arXiv:2506.23361·cs.CV·January 1, 2026

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Yuanhao Cai, He Zhang, Xi Chen, Jinbo Xing, Yiwei Hu, Yuqian Zhou, Kai Zhang, Zhifei Zhang, Soo Ye Kim, Tianyu Wang, Yulun Zhang, Xiaokang Yang, Zhe Lin, Alan Yuille

PDF

Open Access 1 Models 3 Datasets 1 Video

TL;DR

OmniVCus introduces a novel diffusion Transformer framework for multi-subject, subject-driven video customization using multimodal control signals, with a new data construction pipeline enabling instructive editing.

Contribution

The paper presents a new data construction pipeline and a diffusion Transformer framework with innovative embedding mechanisms for multi-subject video customization.

Findings

01

Outperforms state-of-the-art methods in quantitative evaluations.

02

Effectively incorporates multimodal control signals for video editing.

03

Enables multi-subject inference with enhanced embedding techniques.

Abstract

Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
CaiYuanhao/OmniVCus
model· ♡ 4
♡ 4

Datasets

Videos

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Face recognition and analysis