STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

Foivos Paraperas Papantoniou; Stathis Galanakis; Rolandos Alexandros Potamias; Bernhard Kainz; Stefanos Zafeiriou

arXiv:2512.13247·cs.CV·December 16, 2025

STARCaster: Spatio-Temporal AutoRegressive Video Diffusion for Identity- and View-Aware Talking Portraits

Foivos Paraperas Papantoniou, Stathis Galanakis, Rolandos Alexandros Potamias, Bernhard Kainz, Stefanos Zafeiriou

PDF

Open Access

TL;DR

STARCaster introduces a unified spatio-temporal diffusion model for identity- and view-aware talking portraits, improving motion diversity and view synthesis without relying heavily on reference guidance or perfect 3D reconstructions.

Contribution

The paper proposes a novel diffusion-based framework that implicitly encodes 3D awareness and uses decoupled training for view and temporal consistency, advancing talking portrait synthesis.

Findings

01

Outperforms prior methods in benchmarks

02

Generalizes across identities and tasks

03

Produces more dynamic and view-consistent animations

Abstract

This paper presents STARCaster, an identity-aware spatio-temporal video diffusion model that addresses both speech-driven portrait animation and free-viewpoint talking portrait synthesis, given an identity embedding or reference image, within a unified framework. Existing 2D speech-to-video diffusion models depend heavily on reference guidance, leading to limited motion diversity. At the same time, 3D-aware animation typically relies on inversion through pre-trained tri-plane generators, which often leads to imperfect reconstructions and identity drift. We rethink reference- and geometry-based paradigms in two ways. First, we deviate from strict reference conditioning at pre-training by introducing softer identity constraints. Second, we address 3D awareness implicitly within the 2D video domain by leveraging the inherent multi-view nature of video data. STARCaster adopts a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis