TL;DR
This paper introduces SL-ASD, a novel face-voice association framework for audiovisual active speaker detection in egocentric recordings, outperforming traditional synchronisation-based methods under challenging conditions.
Contribution
The work presents a new system that relies solely on face-voice associations, reducing dependence on audiovisual synchronisation, and demonstrates its effectiveness in egocentric scenarios.
Findings
Achieves comparable or better performance than synchronisation-based methods.
Uses fewer learnable parameters, increasing efficiency.
Validates face-voice association as a viable alternative in challenging conditions.
Abstract
Audiovisual active speaker detection (ASD) is conventionally performed by modelling the temporal synchronisation of acoustic and visual speech cues. In egocentric recordings, however, the efficacy of synchronisation-based methods is compromised by occlusions, motion blur, and adverse acoustic conditions. In this work, a novel framework is proposed that exclusively leverages cross-modal face-voice associations to determine speaker activity. An existing face-voice association model is integrated with a transformer-based encoder that aggregates facial identity information by dynamically weighting each frame based on its visual quality. This system is then coupled with a front-end utterance segmentation method, producing a complete ASD system. This work demonstrates that the proposed system, Self-Lifting for audiovisual active speaker detection(SL-ASD), achieves performance comparable to,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
