UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

Le Thien Phuc Nguyen; Zhuoran Yu; Khoa Quang Nhat Cao; Yuwei Guo; Tu Ho Manh Pham; Tuan Tai Nguyen; Toan Ngo Duc Vo; Lucas Poon; Soochahn Lee; Yong Jae Lee

arXiv:2505.21954·cs.CV·May 29, 2025

UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios

Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Soochahn Lee, Yong Jae Lee

PDF

Open Access 1 Repo

TL;DR

UniTalk introduces a challenging, diverse dataset for active speaker detection that highlights the limitations of current models in real-world scenarios and promotes the development of more robust solutions.

Contribution

The paper presents UniTalk, a new large-scale dataset for active speaker detection emphasizing real-world challenges, and demonstrates its effectiveness in improving model generalization.

Findings

01

State-of-the-art models perform poorly on UniTalk compared to AVA.

02

Models trained on UniTalk generalize better to other in-the-wild datasets.

03

UniTalk sets a new benchmark for active speaker detection in realistic conditions.

Abstract

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

plnguyen2908/UniTalk-ASD-code
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis