Audio-Visual Speaker Diarization: Current Databases, Approaches and Challenges
Victoria Mingote, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

TL;DR
This paper reviews current audio-visual speaker diarization methods, introduces a robust adaptable framework for various domains, and proposes a new approach for precise celebrity identification in TV scenarios.
Contribution
It develops a new flexible audio-visual diarization framework and introduces a novel method for celebrity identification in TV content, advancing current state-of-the-art practices.
Findings
Comprehensive compilation of existing databases and approaches.
Proposed framework shows improved adaptability across domains.
New celebrity identification method enhances accuracy in TV scenarios.
Abstract
Nowadays, the large amount of audio-visual content available has fostered the need to develop new robust automatic speaker diarization systems to analyse and characterise it. This kind of system helps to reduce the cost of doing this process manually and allows the use of the speaker information for different applications, as a huge quantity of information is present, for example, images of faces, or audio recordings. Therefore, this paper aims to address a critical area in the field of speaker diarization systems, the integration of audio-visual content of different domains. This paper seeks to push beyond current state-of-the-art practices by developing a robust audio-visual speaker diarization framework adaptable to various data domains, including TV scenarios, meetings, and daily activities. Unlike most of the existing audio-visual speaker diarization systems, this framework will…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
