Cross-modal Supervision for Learning Active Speaker Detection in Video
Punarjay Chakravarty, Tinne Tuytelaars

TL;DR
This paper introduces a weakly supervised method for active speaker detection in videos using audio cues to guide visual learning, enabling models to adapt to new speakers without extensive labeled data.
Contribution
It presents a novel audio-guided weak supervision approach for active speaker detection, including person-specific and online adaptation capabilities.
Findings
First to adapt models across datasets using audio supervision
Effective use of temporal continuity for training without clean labels
Demonstrates online adaptation to unseen speakers
Abstract
In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
