Unsupervised active speaker detection in media content using cross-modal information
Rahul Sharma, Shrikanth Narayanan

TL;DR
This paper introduces an unsupervised, cross-modal framework for detecting active speakers in media content by matching speech segments with facial images based on speaker identity, without requiring labeled data.
Contribution
It formulates active speaker detection as a speech-face assignment problem leveraging speaker identity distances, addressing off-screen speakers, and achieves competitive results on multiple datasets.
Findings
Competitive performance to supervised methods on benchmark datasets
Effective handling of off-screen speakers
Unsupervised approach reduces need for labeled data
Abstract
We present a cross-modal unsupervised framework for active speaker detection in media content such as TV shows and movies. Machine learning advances have enabled impressive performance in identifying individuals from speech and facial images. We leverage speaker identity information from speech and faces, and formulate active speaker detection as a speech-face assignment task such that the active speaker's face and the underlying speech identify the same person (character). We express the speech segments in terms of their associated speaker identity distances, from all other speech segments, to capture a relative identity structure for the video. Then we assign an active speaker's face to each speech segment from the concurrently appearing faces such that the obtained set of active speaker faces displays a similar relative identity structure. Furthermore, we propose a simple and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
