TL;DR
This paper introduces MV-S2V, a novel multi-view subject-consistent video generation method that synthesizes videos from multiple references, utilizing synthetic data and a new conditioning technique to improve 3D consistency.
Contribution
The work presents a new multi-view S2V task, a synthetic data pipeline, and TS-RoPE for better subject-view distinction, advancing subject-driven video synthesis.
Findings
Achieves superior 3D subject consistency with multi-view references.
Develops a synthetic data curation pipeline for training.
Introduces TS-RoPE to distinguish subjects and views effectively.
Abstract
Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi-View S2V (MV-S2V) task, which synthesizes videos from multiple reference views to enforce 3D-level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small-scale real-world captured dataset to boost the training of MV-S2V. Another key issue lies in the potential confusion between cross-subject and cross-view references in conditional generation. To overcome this, we further introduce Temporally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
