Rethinking Audio-visual Synchronization for Active Speaker Detection
Abudukelimu Wuerkaixi, You Zhang, Zhiyao Duan, Changshui Zhang

TL;DR
This paper redefines active speaker detection by emphasizing audio-visual synchronization, proposing a contrastive learning approach with positional encoding, which improves detection accuracy especially for unsynchronized videos.
Contribution
It introduces a clear definition of ASD based on synchronization and a novel cross-modal contrastive learning method with positional encoding to enhance detection performance.
Findings
Existing ASD models often misclassify unsynchronized videos as active speakers.
The proposed method effectively detects unsynchronized speaking, reducing false positives.
Experimental results show improved accuracy over previous approaches.
Abstract
Active speaker detection (ASD) systems are important modules for analyzing multi-talker conversations. They aim to detect which speakers or none are talking in a visual scene at any given time. Existing research on ASD does not agree on the definition of active speakers. We clarify the definition in this work and require synchronization between the audio and visual speaking activities. This clarification of definition is motivated by our extensive experiments, through which we discover that existing ASD methods fail in modeling the audio-visual synchronization and often classify unsynchronized videos as active speaking. To address this problem, we propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue. Experimental results suggest that our model can successfully detect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization
MethodsContrastive Learning
