Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu; Tianxiang Ma; Bingchuan Li; Zhuowei Chen; Jiawei Liu; Gen; Li; Siyu Zhou; Qian He; Xinglong Wu

arXiv:2502.11079·cs.CV·April 11, 2025

Phantom: Subject-consistent video generation via cross-modal alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Gen, Li, Siyu Zhou, Qian He, Xinglong Wu

PDF

Open Access 1 Repo 4 Models

TL;DR

Phantom is a novel framework that achieves high-fidelity, subject-consistent video generation by aligning text and image prompts with video content, improving over existing methods especially in multi-subject scenarios.

Contribution

We introduce Phantom, a unified cross-modal alignment framework that enhances subject consistency in video generation from text and images, addressing content leakage and multi-subject confusion.

Findings

01

Outperforms state-of-the-art commercial solutions.

02

Achieves high-fidelity, subject-consistent videos.

03

Effectively handles multi-subject references.

Abstract

The continuous development of foundational models for video generation is evolving into various applications, with subject-consistent video generation still in the exploratory stage. We refer to this as Subject-to-Video, which extracts subject elements from reference images and generates subject-consistent videos following textual instructions. We believe that the essence of subject-to-video lies in balancing the dual-modal prompts of text and image, thereby deeply and simultaneously aligning both text and visual content. To this end, we propose Phantom, a unified video generation framework for both single- and multi-subject references. Building on existing text-to-video and image-to-video architectures, we redesign the joint text-image injection model and drive it to learn cross-modal alignment via text-image-video triplet data. The proposed method achieves high-fidelity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

phantom-video/phantom
pytorch

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization