Audio-Visual Activity Guided Cross-Modal Identity Association for Active   Speaker Detection

Rahul Sharma; Shrikanth Narayanan

arXiv:2212.00539·cs.MM·December 2, 2022

Audio-Visual Activity Guided Cross-Modal Identity Association for Active Speaker Detection

Rahul Sharma, Shrikanth Narayanan

PDF

Open Access 1 Repo

TL;DR

This paper introduces an unsupervised framework that combines audio-visual activity cues and speaker identity information to improve active speaker detection in videos, addressing limitations of individual methods.

Contribution

It proposes a novel late fusion approach that leverages the complementary strengths of activity-based and identity-based methods for better speaker detection.

Findings

01

Fusion improves detection accuracy on benchmark datasets.

02

Combining modalities reduces confusion with non-speech vocal activities.

03

Unsupervised framework eliminates need for extensive labeled data.

Abstract

Active speaker detection in videos addresses associating a source face, visible in the video frames, with the underlying speech in the audio modality. The two primary sources of information to derive such a speech-face relationship are i) visual activity and its interaction with the speech signal and ii) co-occurrences of speakers' identities across modalities in the form of face and speech. The two approaches have their limitations: the audio-visual activity models get confused with other frequently occurring vocal activities, such as laughing and chewing, while the speakers' identity-based methods are limited to videos having enough disambiguating information to establish a speech-face association. Since the two approaches are independent, we investigate their complementary nature in this work. We propose a novel unsupervised framework to guide the speakers' cross-modal identity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rash1993/movie-asd
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Music and Audio Processing