Audio-Visual Speaker Diarization: Current Databases, Approaches and   Challenges

Victoria Mingote; Alfonso Ortega; Antonio Miguel; Eduardo Lleida

arXiv:2409.05659·cs.SD·September 10, 2024

Audio-Visual Speaker Diarization: Current Databases, Approaches and Challenges

Victoria Mingote, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

PDF

Open Access

TL;DR

This paper reviews current audio-visual speaker diarization methods, introduces a robust adaptable framework for various domains, and proposes a new approach for precise celebrity identification in TV scenarios.

Contribution

It develops a new flexible audio-visual diarization framework and introduces a novel method for celebrity identification in TV content, advancing current state-of-the-art practices.

Findings

01

Comprehensive compilation of existing databases and approaches.

02

Proposed framework shows improved adaptability across domains.

03

New celebrity identification method enhances accuracy in TV scenarios.

Abstract

Nowadays, the large amount of audio-visual content available has fostered the need to develop new robust automatic speaker diarization systems to analyse and characterise it. This kind of system helps to reduce the cost of doing this process manually and allows the use of the speaker information for different applications, as a huge quantity of information is present, for example, images of faces, or audio recordings. Therefore, this paper aims to address a critical area in the field of speaker diarization systems, the integration of audio-visual content of different domains. This paper seeks to push beyond current state-of-the-art practices by developing a robust audio-visual speaker diarization framework adaptable to various data domains, including TV scenarios, meetings, and daily activities. Unlike most of the existing audio-visual speaker diarization systems, this framework will…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing