EchoShot: Multi-Shot Portrait Video Generation

Jiahao Wang; Hualian Sheng; Sijia Cai; Weizhan Zhang; Caixia Yan; Yachuang Feng; Bing Deng; Jieping Ye

arXiv:2506.15838·cs.CV·June 23, 2025

EchoShot: Multi-Shot Portrait Video Generation

Jiahao Wang, Hualian Sheng, Sijia Cai, Weizhan Zhang, Caixia Yan, Yachuang Feng, Bing Deng, Jieping Ye

PDF

Open Access 1 Models

TL;DR

EchoShot is a scalable multi-shot portrait video generation framework that ensures identity consistency and attribute control, built on a novel video diffusion transformer and a new large-scale dataset.

Contribution

It introduces shot-aware position embeddings for multi-shot modeling and constructs PortraitGala, a high-fidelity dataset for training and evaluation.

Findings

01

Achieves superior identity consistency in multi-shot videos

02

Enables attribute-level controllability in portrait video synthesis

03

Supports reference image-based personalized and long video generation

Abstract

Video diffusion models substantially boost the productivity of artistic workflows with high-quality portrait video generative capacity. However, prevailing pipelines are primarily constrained to single-shot creation, while real-world applications urge for multiple shots with identity consistency and flexible content controllability. In this work, we propose EchoShot, a native and scalable multi-shot framework for portrait customization built upon a foundation video diffusion model. To start with, we propose shot-aware position embedding mechanisms within video diffusion transformer architecture to model inter-shot variations and establish intricate correspondence between multi-shot visual content and their textual descriptions. This simple yet effective design enables direct training on multi-shot video data without introducing additional computational overhead. To facilitate model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
JonneyWang/EchoShot
model· ♡ 22
♡ 22

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Multimodal Machine Learning Applications