UniTalk: Towards Universal Active Speaker Detection in Real World Scenarios
Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc Vo, Lucas Poon, Soochahn Lee, Yong Jae Lee

TL;DR
UniTalk introduces a challenging, diverse dataset for active speaker detection that highlights the limitations of current models in real-world scenarios and promotes the development of more robust solutions.
Contribution
The paper presents UniTalk, a new large-scale dataset for active speaker detection emphasizing real-world challenges, and demonstrates its effectiveness in improving model generalization.
Findings
State-of-the-art models perform poorly on UniTalk compared to AVA.
Models trained on UniTalk generalize better to other in-the-wild datasets.
UniTalk sets a new benchmark for active speaker detection in realistic conditions.
Abstract
We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
