Speaker Diarization of Scripted Audiovisual Content
Yogesh Virkar, Brian Thompson, Rohit Paturi, Sundararajan Srinivasan,, Marcello Federico

TL;DR
This paper introduces a semi-supervised speaker diarization method that uses production scripts to improve accuracy in TV show audio, addressing challenges of multiple speakers and frequent changes.
Contribution
It presents a novel semi-supervised approach leveraging production scripts to enhance speaker diarization accuracy in TV shows.
Findings
Achieved 51.7% relative improvement over baseline models
Effectively utilized production scripts for pseudo-labeling
Demonstrated significant gains on a 66-show test set
Abstract
The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
