Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings
Jason Clarke, Yoshihiko Gotoh, Stefan Goetze

TL;DR
This paper introduces SCAN, a speaker comparison network that enhances audiovisual active speaker detection in egocentric recordings by leveraging speaker embeddings from reference speech, improving accuracy in noisy, dynamic scenes.
Contribution
The paper presents SCAN, a novel auxiliary network utilizing speaker-specific information to disambiguate challenging scenes in active speaker detection, especially in egocentric recordings.
Findings
SCAN improves mAP by 14.5% with TalkNet
SCAN improves mAP by 10.3% with Light-ASD
Enhanced speaker embedding utilization boosts detection accuracy
Abstract
Audiovisual active speaker detection (ASD) addresses the task of determining the speech activity of a candidate speaker given acoustic and visual data. Typically, systems model the temporal correspondence of audiovisual cues, such as the synchronisation between speech and lip movement. Recent work has explored extending this paradigm by additionally leveraging speaker embeddings extracted from candidate speaker reference speech. This paper proposes the speaker comparison auxiliary network (SCAN) which uses speaker-specific information from both reference speech and the candidate audio signal to disambiguate challenging scenes when the visual signal is unresolvable. Furthermore, an improved method for enrolling face-speaker libraries is developed, which implements a self-supervised approach to video-based face recognition. Fitting with the recent proliferation of wearable devices, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing
