EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

Yuzhe Weng; Haotian Wang; Yuanhong Yu; Jun Du; Shan He; Xiaoyan Wu; Haoran Xu

arXiv:2603.20307·cs.CV·March 24, 2026

EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

Yuzhe Weng, Haotian Wang, Yuanhong Yu, Jun Du, Shan He, Xiaoyan Wu, Haoran Xu

PDF

Open Access

TL;DR

EARTalking introduces an end-to-end autoregressive model for realistic, controllable talking head video generation from audio, overcoming limitations of previous methods with a novel streaming, in-context control mechanism.

Contribution

The paper presents EARTalking, a novel GPT-style autoregressive model with frame-wise control and Sink Frame Window Attention for high-quality, flexible talking head synthesis.

Findings

01

Outperforms existing autoregressive methods in quality.

02

Achieves performance comparable to diffusion-based methods.

03

Enables interactive, arbitrary control at every frame.

Abstract

Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Music Technology and Sound Studies