Unsupervised active speaker detection in media content using cross-modal   information

Rahul Sharma; Shrikanth Narayanan

arXiv:2209.11896·eess.IV·September 27, 2022·1 cites

Unsupervised active speaker detection in media content using cross-modal information

Rahul Sharma, Shrikanth Narayanan

PDF

Open Access 1 Repo

TL;DR

This paper introduces an unsupervised, cross-modal framework for detecting active speakers in media content by matching speech segments with facial images based on speaker identity, without requiring labeled data.

Contribution

It formulates active speaker detection as a speech-face assignment problem leveraging speaker identity distances, addressing off-screen speakers, and achieves competitive results on multiple datasets.

Findings

01

Competitive performance to supervised methods on benchmark datasets

02

Effective handling of off-screen speakers

03

Unsupervised approach reduces need for labeled data

Abstract

We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies. Machine learning advances have enabled impressive performance in identifying individuals from speech and facial images. We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task such that the active speaker's face and the underlying speech identify the same person (character). We express the speech segments in terms of their associated speaker identity distances, from all other speech segments, to capture a relative identity structure for the video. Then we assign an active speaker's face to each speech segment from the concurrently appearing faces such that the obtained set of active speaker faces displays a similar relative identity structure. Furthermore, we propose a simple and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rash1993/movie-asd
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis