Speaker Diarization of Scripted Audiovisual Content

Yogesh Virkar; Brian Thompson; Rohit Paturi; Sundararajan Srinivasan,; Marcello Federico

arXiv:2308.02160·cs.CL·August 7, 2023·1 cites

Speaker Diarization of Scripted Audiovisual Content

Yogesh Virkar, Brian Thompson, Rohit Paturi, Sundararajan Srinivasan,, Marcello Federico

PDF

Open Access

TL;DR

This paper introduces a semi-supervised speaker diarization method that uses production scripts to improve accuracy in TV show audio, addressing challenges of multiple speakers and frequent changes.

Contribution

It presents a novel semi-supervised approach leveraging production scripts to enhance speaker diarization accuracy in TV shows.

Findings

01

Achieved 51.7% relative improvement over baseline models

02

Effectively utilized production scripts for pseudo-labeling

03

Demonstrated significant gains on a 66-show test set

Abstract

The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must be structured into a sequence of dialogue lines each including time codes, speaker name and transcript. Current speech recognition technology alleviates the transcription step. However, state-of-the-art speaker diarization models still fall short on TV shows for two main reasons: (i) their inability to track a large number of speakers, (ii) their low accuracy in detecting frequent speaker changes. To mitigate this problem, we present a novel approach to leverage production scripts used during the shooting process, to extract pseudo-labeled data for the speaker diarization task. We propose a novel semi-supervised approach and demonstrate improvements…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques