InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

Shaoshu Yang; Zhe Kong; Feng Gao; Meng Cheng; Xiangyu Liu; Yong Zhang; Zhuoliang Kang; Wenhan Luo; Xunliang Cai; Ran He; and Xiaoming Wei

arXiv:2508.14033·cs.CV·August 20, 2025

InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing

Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, and Xiaoming Wei

PDF

5 Reviews

TL;DR

InfiniteTalk introduces a novel audio-driven video generation method that enables seamless, full-body, long-sequence video dubbing by preserving key reference frames and leveraging temporal context for improved realism and synchronization.

Contribution

The paper presents InfiniteTalk, a streaming generator for infinite-length video dubbing that overcomes naive model limitations by using temporal context and reference frames for holistic, synchronized full-body animation.

Findings

01

Achieves state-of-the-art realism and synchronization on multiple datasets.

02

Effectively maintains identity and gestures during long video sequences.

03

Outperforms existing methods in visual coherence and emotional expression.

Abstract

Recent breakthroughs in video AIGC have ushered in a transformative era for audio-driven human animation. However, conventional video dubbing techniques remain constrained to mouth region editing, resulting in discordant facial expressions and body gestures that compromise viewer immersion. To overcome this limitation, we introduce sparse-frame video dubbing, a novel paradigm that strategically preserves reference keyframes to maintain identity, iconic gestures, and camera trajectories while enabling holistic, audio-synchronized full-body motion editing. Through critical analysis, we identify why naive image-to-video models fail in this task, particularly their inability to achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a streaming audio-driven generator designed for infinite-length long sequence dubbing. This architecture leverages temporal context frames for…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

* This paper goes beyond the traditional paradigm of mouth region editing, aiming to synchronize facial, head, and body movements with audio. The integration of key reference frames to preserve identity, gestures, and camera trajectory while enabling dynamic full-body motion is a promising advancement. * The method is evaluated on multiple datasets (HDTF, CelebV-HQ, EMTD), using both quantitative and qualitative metrics. The model's ability to perform audio-aligned motion with high visual quali

Weaknesses

* Figure 5 is confusing and difficult to understand. The caption of Figure 5 refers to "top 4 rows," but only 3 rows are visible in Figure 5 (left). Additionally, Figure 5 (right) lacks sufficient details for readers to understand how M0-M4 are implemented, and the meaning of the rectangles with different colors remains unclear. The figure should be corrected to match the caption, and a more detailed illustration about Sec.3.4 should be provided, clarifying the color-coding and the implementatio

Reviewer 02Rating 4Confidence 4

Strengths

1. the objective and subjective results quality are good, 2. the models are open-source, which is useful for the community. 3. the paper is organized and could follow. 4. the video results from supp. are sufficient.

Weaknesses

1. Better Video Dubbing or Audio Contioned Text-Image to Video (ATI2V) model? the motivation of this paper is propose a better video dubbing that could edit more than lip region. But from my understanding video dubbing and ati2v are different tasks, you may only want to edit the lip region and keep the body gesture from original actor, for example, video dubbing or post editing for films. I think it is unfair to compare with video dubbing models. Besides, in general any ati2v models have this a

Reviewer 03Rating 6Confidence 4

Strengths

1. Novel Problem Formulation: the notion of “sparse-frame video dubbing” is interesting and it generalizes mouth-region editing to holistic, full-body motion editing in dubbing tasks. 2. Additional modules (SDEdit, Uni3C, UniAnimate) show good extensibility for camera and pose control.

Weaknesses

1. Unclear Methodological Details: equation (3) lacks explicit tensor dimension consistency; although fig 5 looks fancy, actually it doesn't help readers to understand the different type of reference position strategy. 2. Lack enough experiments: how the frames of context affect the continuity; the effectiveness comparison between the reference cross-attention and channel-wise concatenation. 3. Overstatement of Generalization and “Infinite-Length” Claims: “Infinite-length” generation is actu

Reviewer 04Rating 4Confidence 4

Strengths

1. InfiniteTalk proposes the "sparse-frame video dubbing" paradigm. It breaks traditional "mouth-only edit" limitations, enabling full-body audio-aligned motion while preserving the source video’s identity, scene, and camera trajectory. 2. The paper adapts context frame and reference frame strategies to dubbing needs. Context frames focus on dynamic motion, bind to audio via cross-attention, and solve long-sequence abruptness and motion-audio mismatch. 3. Its experiments are rigorous, using 3 da

Weaknesses

1. While applied to video dubbing, the core reference-based long-sequence generation strategy is already common in I2V, limiting the work’s originality. 2. Supplementary results omit camera control examples, and the analysis of limitations (e.g., motion repetition, efficiency) is insufficient. 3. It would be good to visualize the ablation study for the soft reference conditioning.

Reviewer 05Rating 4Confidence 4

Strengths

1. The exploration of the reference strategy is relatively detailed, which is inspiring for other works in the field. 2. The chunk transition in the demo video is relatively natural, and traces of chunk switching are hardly noticeable. Moreover, with a duration of approximately one minute, the video does not show obvious error accumulation. 3. The quantitative experiments have achieved results comparable to SOTAs. 4. The writing and presentation are clear and easy to understand.

Weaknesses

1. The proposed task seems to be confusing. The proposed sparse video dubbing task requires the original video to obtain sparse key frames, and the generated length depends on the length of the original video. However, the contribution claimed in the paper is infinite-length video generation, which conflicts with the task setting since there are no “infinite” images in a certain video. Additionally, the method in the paper does not include content related to key frame planning and generation. 2.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.