KlingAvatar 2.0 Technical Report

Kling Team: Jialu Chen; Yikang Ding; Zhixue Fang; Kun Gai; Yuan Gao; Kang He; Jingyun Hua; Boyuan Jiang; Mingming Lao; Xiaohan Li; Hui Liu; Jiwen Liu; Xiaoqiang Liu; Yuan Liu; Shun Lu; Yongsen Mao; Yingchao Shao; Huafeng Shi; Xiaoyu Shi; Peiqin Sun; Songlin Tang; Pengfei Wan; Chao Wang; Xuebo Wang; Haoxian Zhang; Yuanxing Zhang; Yan Zhou

arXiv:2512.13313·cs.CV·December 16, 2025

KlingAvatar 2.0 Technical Report

Kling Team: Jialu Chen, Yikang Ding, Zhixue Fang, Kun Gai, Yuan Gao, Kang He, Jingyun Hua, Boyuan Jiang, Mingming Lao, Xiaohan Li, Hui Liu, Jiwen Liu, Xiaoqiang Liu, Yuan Liu, Shun Lu, Yongsen Mao, Yingchao Shao, Huafeng Shi, Xiaoyu Shi, Peiqin Sun, Songlin Tang, Pengfei Wan

PDF

Open Access

TL;DR

KlingAvatar 2.0 introduces a spatio-temporal cascade framework with multi-modal instruction fusion for efficient, high-resolution, long-duration avatar video generation with improved quality and coherence.

Contribution

The paper presents KlingAvatar 2.0, a novel framework combining spatio-temporal upscaling and multi-modal instruction alignment for long-form avatar videos.

Findings

01

Effective long-duration high-resolution video generation

02

Enhanced lip synchronization and identity preservation

03

Improved multimodal instruction following

Abstract

Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Speech and Audio Processing