MoCha:End-to-End Video Character Replacement without Structural Guidance

Zhengbo Xu; Jie Ma; Ziheng Wang; Zhan Peng; Jun Liang; Jing Li

arXiv:2601.08587·cs.CV·January 15, 2026

MoCha:End-to-End Video Character Replacement without Structural Guidance

Zhengbo Xu, Jie Ma, Ziheng Wang, Zhan Peng, Jun Liang, Jing Li

PDF

Open Access

TL;DR

MoCha introduces an end-to-end framework for video character replacement that operates without structural guidance, using minimal input and a novel data pipeline to outperform existing methods in complex scenarios.

Contribution

The paper presents MoCha, a novel approach that replaces video characters with only a single mask, bypassing structural guidance and utilizing a new data construction pipeline.

Findings

01

Outperforms state-of-the-art methods in complex scenarios

02

Requires only a single arbitrary frame mask for replacement

03

Uses a new data pipeline with UE5 and synthesized datasets

Abstract

Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Motion and Animation