AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Zhizhou Zhong; Yicheng Ji; Zhe Kong; Yiying Liu; Jiarui Wang; Jiasun Feng; Lupeng Liu; Xiangyi Wang; Yanjia Li; Yuqing She; Ying Qin; Huan Li; Shuiyang Mao; Wei Liu; Wenhan Luo

arXiv:2511.23475·cs.CV·December 1, 2025

AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

Zhizhou Zhong, Yicheng Ji, Zhe Kong, Yiying Liu, Jiarui Wang, Jiasun Feng, Lupeng Liu, Xiangyi Wang, Yanjia Li, Yuqing She, Ying Qin, Huan Li, Shuiyang Mao, Wei Liu, Wenhan Luo

PDF

Open Access 1 Models

TL;DR

AnyTalker is a scalable multi-person video generation framework that uses a novel attention mechanism and training pipeline to produce natural, interactive talking videos with minimal multi-person data.

Contribution

It introduces an identity-aware attention mechanism and a training pipeline that relies on single-person videos, enabling scalable multi-person talking video generation with limited data.

Findings

01

Achieves high lip synchronization and visual quality.

02

Supports arbitrary scaling of drivable identities.

03

Balances data costs with identity scalability.

Abstract

Recently, multi-person video generation has started to gain prominence. While a few preliminary works have explored audio-driven multi-person talking video generation, they often face challenges due to the high costs of diverse multi-person data collection and the difficulty of driving multiple identities with coherent interactivity. To address these challenges, we propose AnyTalker, a multi-person generation framework that features an extensible multi-stream processing architecture. Specifically, we extend Diffusion Transformer's attention block with a novel identity-aware attention mechanism that iteratively processes identity-audio pairs, allowing arbitrary scaling of drivable identities. Besides, training multi-person generative models demands massive multi-person data. Our proposed training pipeline depends solely on single-person videos to learn multi-person speaking patterns and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zzz66/AnyTalker-1.3B
model· ♡ 12
♡ 12

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Music Technology and Sound Studies