Ensembling Synchronisation-based and Face-Voice Association Paradigms for Robust Active Speaker Detection in Egocentric Recordings
Jason Clarke, Yoshihiko Gotoh, Stefan Goetze

TL;DR
This paper introduces an ensemble method combining synchronisation-based and face-voice association models to improve active speaker detection in egocentric videos, addressing challenges like occlusion and motion blur.
Contribution
It proposes a simple weighted averaging ensemble of two complementary models and a refined preprocessing pipeline for face-voice association, enhancing robustness in challenging conditions.
Findings
Achieved 70.2% mAP with TalkNet backbone.
Achieved 66.7% mAP with Light-ASD backbone.
Demonstrated improved robustness over individual models.
Abstract
Audiovisual active speaker detection (ASD) in egocentric recordings is challenged by frequent occlusions, motion blur, and audio interference, which undermine the discernability of temporal synchrony between lip movement and speech. Traditional synchronisation-based systems perform well under clean conditions but degrade sharply in first-person recordings. Conversely, face-voice association (FVA)-based methods forgo synchronisation modelling in favour of cross-modal biometric matching, exhibiting robustness to transient visual corruption but suffering when overlapping speech or front-end segmentation errors occur. In this paper, a simple yet effective ensemble approach is proposed to fuse synchronisation-dependent and synchronisation-agnostic model outputs via weighted averaging, thereby harnessing complementary cues without introducing complex fusion architectures. A refined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
