EchoShot: Multi-Shot Portrait Video Generation
Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, Jieping Ye

TL;DR
EchoShot is a scalable multi-shot portrait video generation framework that ensures identity consistency and attribute control, built on a novel video diffusion transformer and a new large-scale dataset.
Contribution
It introduces shot-aware position embeddings for multi-shot modeling and constructs PortraitGala, a high-fidelity dataset for training and evaluation.
Findings
Achieves superior identity consistency in multi-shot videos
Enables attribute-level controllability in portrait video synthesis
Supports reference image-based personalized and long video generation
Abstract
Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge for multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications
