Cross-modal Supervision for Learning Active Speaker Detection in Video

Punarjay Chakravarty; Tinne Tuytelaars

arXiv:1603.08907·cs.CV·March 30, 2016·1 cites

Cross-modal Supervision for Learning Active Speaker Detection in Video

Punarjay Chakravarty, Tinne Tuytelaars

PDF

Open Access

TL;DR

This paper introduces a weakly supervised method for active speaker detection in videos using audio cues to guide visual learning, enabling models to adapt to new speakers without extensive labeled data.

Contribution

It presents a novel audio-guided weak supervision approach for active speaker detection, including person-specific and online adaptation capabilities.

Findings

01

First to adapt models across datasets using audio supervision

02

Effective use of temporal continuity for training without clean labels

03

Demonstrates online adaptation to unseen speakers

Abstract

In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis