EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control
Yuzhe Weng, Haotian Wang, Yuanhong Yu, Jun Du, Shan He, Xiaoyan Wu, Haoran Xu

TL;DR
EARTalking introduces an end-to-end autoregressive model for realistic, controllable talking head video generation from audio, overcoming limitations of previous methods with a novel streaming, in-context control mechanism.
Contribution
The paper presents EARTalking, a novel GPT-style autoregressive model with frame-wise control and Sink Frame Window Attention for high-quality, flexible talking head synthesis.
Findings
Outperforms existing autoregressive methods in quality.
Achieves performance comparable to diffusion-based methods.
Enables interactive, arbitrary control at every frame.
Abstract
Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Music Technology and Sound Studies
