EchoVideo: Identity-Preserving Human Video Generation by Multimodal   Feature Fusion

Jiangchuan Wei; Shiyue Yan; Wenfeng Lin; Boyuan Liu; Renjie Chen and; Mingyu Guo

arXiv:2501.13452·cs.CV·February 28, 2025

EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

Jiangchuan Wei, Shiyue Yan, Wenfeng Lin, Boyuan Liu, Renjie Chen and, Mingyu Guo

PDF

Open Access 1 Repo 1 Models

TL;DR

EchoVideo introduces a novel multimodal feature fusion approach with a two-stage training strategy to improve identity preservation and reduce artifacts in human video generation.

Contribution

It proposes the EchoVideo framework with high-level semantic feature integration and a stochastic training method to enhance identity fidelity in video synthesis.

Findings

01

Effective identity preservation in generated videos

02

Reduced artifacts and improved visual quality

03

High controllability and fidelity in outputs

Abstract

Recent advancements in video generation have significantly impacted various downstream applications, particularly in identity-preserving video generation (IPT2V). However, existing methods struggle with "copy-paste" artifacts and low similarity issues, primarily due to their reliance on low-level facial image information. This dependence can result in rigid facial appearances and artifacts reflecting irrelevant details. To address these challenges, we propose EchoVideo, which employs two key strategies: (1) an Identity Image-Text Fusion Module (IITF) that integrates high-level semantic features from text, capturing clean facial identity representations while discarding occlusions, poses, and lighting variations to avoid the introduction of artifacts; (2) a two-stage training strategy, incorporating a stochastic method in the second phase to randomly utilize shallow facial information.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/echovideo
pytorch

Models

🤗
bytedance-research/EchoVideo
model· 3 dl· ♡ 7
3 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Human Pose and Action Recognition