EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

Yingjie Zhou; Xilei Zhu; Siyu Ren; Ziyi Zhao; Ziwen Wang; Farong Wen; Yu Zhou; Jiezhang Cao; Xiongkuo Min; Fengjiao Chen; Xiaoyu Li; Xuezhi Cao; Guangtao Zhai; Xiaohong Liu

arXiv:2512.01340·cs.CV·December 2, 2025

EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans

Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang, Farong Wen, Yu Zhou, Jiezhang Cao, Xiongkuo Min, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu

PDF

Open Access 5 Models

TL;DR

This paper introduces EvalTalker, a new framework for assessing the quality of multi-subject talking human videos, addressing current limitations in multi-talker generation by leveraging a large-scale dataset and perceptual analysis.

Contribution

It presents the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset and a novel evaluation framework that correlates well with human subjective scores.

Findings

01

EvalTalker outperforms existing methods in correlation with subjective scores.

02

Identified 12 common distortion types affecting multi-talker quality.

03

Constructed a dataset of 5,492 multi-talker generated talking humans.

Abstract

Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Social Robot Interaction and HRI · Emotion and Mood Recognition