Prompting Whisper for Joint Speech Transcription and Diarization
Mariia Zamyrova, Henk van den Heuvel

TL;DR
This research explores enhancing Whisper for real-time Dutch speech transcription and diarization, focusing on prompt engineering and fine-tuning to improve speaker labeling and transcription accuracy.
Contribution
It demonstrates that prompt-based and fine-tuned Whisper models can improve speaker diarization and transcription, revealing new challenges in overlapping speech and timestamp accuracy.
Findings
Prompting Whisper with speaker labels yields promising diarization accuracy.
Fine-tuning Whisper improves speaker ID consistency and transcription quality.
Challenges remain with overlapping speech and timestamp errors affecting diarization.
Abstract
As part of the MediSpeech project, we aim to develop a system that transcribes and diarizes Dutch conversations between doctors and patients in real-time. In this research (in-progress) we explore ways of efficiently combining Whisper with speaker diarization (SD). After trying to prompt Whisper with text that contains speaker labels, we observed that it is able to insert labels into the transcription with promising accuracy. We continued this line of research by fine-tuning Whisper with speaker-labelled prompts to generate transcriptions in a format similar to that of Serialized Output Training (SOT). Fine-tuning Whisper yielded more consistent speaker IDs across the chunks of long-form audio and improved verbatim transcription. The study uncovered new challenges as Whisper's SD performance suffers because of mistakes that get propagated through prompts and inaccurate timestamps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
