V-Warper: Appearance-Consistent Video Diffusion Personalization via Value Warping
Hyunkoo Lee, Wooseok Jang, Jini Yang, Taehwan Kim, Sangoh Kim, Sangwon Jung, Seungryong Kim

TL;DR
V-Warper is a training-free framework that enhances appearance consistency in video diffusion personalization by combining coarse image-based adaptation with inference-time appearance refinement, avoiding heavy video finetuning.
Contribution
It introduces a novel, training-free, coarse-to-fine personalization method for video diffusion models that maintains appearance fidelity without large-scale video datasets.
Findings
Significantly improves appearance fidelity in personalized videos.
Maintains prompt alignment and motion dynamics effectively.
Operates efficiently without additional video training.
Abstract
Video personalization aims to generate videos that faithfully reflect a user-provided subject while following a text prompt. However, existing approaches often rely on heavy video-based finetuning or large-scale video datasets, which impose substantial computational cost and are difficult to scale. Furthermore, they still struggle to maintain fine-grained appearance consistency across frames. To address these limitations, we introduce V-Warper, a training-free coarse-to-fine personalization framework for transformer-based video diffusion models. The framework enhances fine-grained identity fidelity without requiring any additional video training. (1) A lightweight coarse appearance adaptation stage leverages only a small set of reference images, which are already required for the task. This step encodes global subject identity through image-only LoRA and subject-embedding adaptation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications
