TL;DR
TAVR is a novel framework for generating high-fidelity talking avatars from cross-scene video references, overcoming single-view limitations through a multi-stage training scheme and a new robustness benchmark.
Contribution
It introduces a cross-scene video reference approach with a token selection module and a three-stage training process for robust avatar synthesis.
Findings
TAVR outperforms existing methods quantitatively and qualitatively.
The framework demonstrates strong cross-scene robustness.
A new benchmark with 158 cross-scene video pairs was created.
Abstract
Existing talking avatar methods typically adopt an image-to-video pipeline conditioned on a static reference image within the same scene as the target generation. This restricted, single-view perspective lacks sufficient temporal and expression cues, limiting the ability to synthesize high-fidelity talking avatars in customized backgrounds. To this end, we introduce Talking Avatar generation from Video Reference (TAVR), a novel framework that shifts the paradigm by leveraging cross-scene video inputs. To effectively process these extended temporal contexts and bridge cross-scene domain gaps, TAVR integrates a token selection module alongside a comprehensive three-stage training scheme. Specifically, same-scene video pretraining establishes foundational appearance copying, which is subsequently expanded by cross-scene reference fine-tuning for robust cross-scene adaptation. Finally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
