Speaker Embedding Informed Audiovisual Active Speaker Detection for   Egocentric Recordings

Jason Clarke; Yoshihiko Gotoh; Stefan Goetze

arXiv:2502.06012·cs.MM·February 11, 2025

Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings

Jason Clarke, Yoshihiko Gotoh, Stefan Goetze

PDF

Open Access

TL;DR

This paper introduces SCAN, a speaker comparison network that enhances audiovisual active speaker detection in egocentric recordings by leveraging speaker embeddings from reference speech, improving accuracy in noisy, dynamic scenes.

Contribution

The paper presents SCAN, a novel auxiliary network utilizing speaker-specific information to disambiguate challenging scenes in active speaker detection, especially in egocentric recordings.

Findings

01

SCAN improves mAP by 14.5% with TalkNet

02

SCAN improves mAP by 10.3% with Light-ASD

03

Enhanced speaker embedding utilization boosts detection accuracy

Abstract

Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing