Self-Supervised Vision-Based Detection of the Active Speaker as Support   for Socially-Aware Language Acquisition

Kalin Stefanov; Jonas Beskow; Giampiero Salvi

arXiv:1711.08992·cs.CV·July 19, 2019

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

Kalin Stefanov, Jonas Beskow, Giampiero Salvi

PDF

TL;DR

This paper introduces a self-supervised visual method for detecting active speakers in multi-person social interactions, aiming to enhance language acquisition systems by combining visual and auditory cues without external annotations.

Contribution

It presents a novel self-supervised approach that detects active speakers visually using auditory information, without relying on external labels, suitable for social and cognitive systems.

Findings

01

Good performance in speaker-dependent settings

02

Lower performance in speaker-independent scenarios

03

Potential as a component for social robots

Abstract

This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlapping active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.