EvalTalker: Learning to Evaluate Real-Portrait-Driven Multi-Subject Talking Humans
Yingjie Zhou, Xilei Zhu, Siyu Ren, Ziyi Zhao, Ziwen Wang, Farong Wen, Yu Zhou, Jiezhang Cao, Xiongkuo Min, Fengjiao Chen, Xiaoyu Li, Xuezhi Cao, Guangtao Zhai, Xiaohong Liu

TL;DR
This paper introduces EvalTalker, a new framework for assessing the quality of multi-subject talking human videos, addressing current limitations in multi-talker generation by leveraging a large-scale dataset and perceptual analysis.
Contribution
It presents the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset and a novel evaluation framework that correlates well with human subjective scores.
Findings
EvalTalker outperforms existing methods in correlation with subjective scores.
Identified 12 common distortion types affecting multi-talker quality.
Constructed a dataset of 5,492 multi-talker generated talking humans.
Abstract
Speech-driven Talking Human (TH) generation, commonly known as "Talker," currently faces limitations in multi-subject driving capabilities. Extending this paradigm to "Multi-Talker," capable of animating multiple subjects simultaneously, introduces richer interactivity and stronger immersion in audiovisual communication. However, current Multi-Talkers still exhibit noticeable quality degradation caused by technical limitations, resulting in suboptimal user experiences. To address this challenge, we construct THQA-MT, the first large-scale Multi-Talker-generated Talking Human Quality Assessment dataset, consisting of 5,492 Multi-Talker-generated THs (MTHs) from 15 representative Multi-Talkers using 400 real portraits collected online. Through subjective experiments, we analyze perceptual discrepancies among different Multi-Talkers and identify 12 common types of distortion. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗vantagewithai/LongCat-Video-Avatar-ComfyUI-GGUFmodel· 4.3k dl· ♡ 174.3k dl♡ 17
- 🤗meituan-longcat/LongCat-Video-Avatarmodel· 263 dl· ♡ 229263 dl♡ 229
- 🤗fjkane/LongCat-Video-Avatar-bf16model· 2 dl· ♡ 22 dl♡ 2
- 🤗Frederic75/LongCat-Video-Avatar-ComfyUI-GGUFmodel· 452 dl452 dl
- 🤗krapiunitski/longcat-avatar-weightsmodel· 20 dl20 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Social Robot Interaction and HRI · Emotion and Mood Recognition
