Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection
Rahul Sharma, Shrikanth Narayanan

TL;DR
This paper introduces an unsupervised framework that combines audio-visual activity cues and speaker identity information to improve active speaker detection in videos, addressing limitations of individual methods.
Contribution
It proposes a novel late fusion approach that leverages the complementary strengths of activity-based and identity-based methods for better speaker detection.
Findings
Fusion improves detection accuracy on benchmark datasets.
Combining modalities reduces confusion with non-speech vocal activities.
Unsupervised framework eliminates need for extensive labeled data.
Abstract
Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Music and Audio Processing
