UniCon: Unified Context Network for Robust Active Speaker Detection
Yuanhang Zhang, Susan Liang, Shuang Yang, Xiao Liu, Zhongqin Wu,, Shiguang Shan, Xilin Chen

TL;DR
UniCon is a unified framework for active speaker detection that models spatial, relational, and temporal contexts jointly, significantly improving accuracy especially in challenging scenarios with multiple or low-resolution faces.
Contribution
The paper introduces UniCon, a novel unified model that jointly captures spatial, relational, and temporal contexts for robust active speaker detection, outperforming previous methods.
Findings
Outperforms state-of-the-art by about 15% mAP on challenging ASD benchmarks.
Achieves 92.0% mAP on AVA-ActiveSpeaker dataset, surpassing 90%.
Effectively handles multiple candidates and low-resolution faces.
Abstract
We introduce a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider the relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
