Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization
Ambre Marie (LaTIM), Thomas Bertin (DySoLab), Guillaume Dardenne (LaTIM), Gwenol\'e Quellec (LaTIM)

TL;DR
This paper introduces an iterative multi-pass LLM-based approach to improve French clinical speech transcription and speaker diarization, demonstrating significant accuracy gains and stability in medical conversation datasets.
Contribution
It presents a novel multi-pass LLM post-processing architecture with ablation studies, optimizing design choices for clinical speech transcription.
Findings
Significant reduction in word error rate on suicide prevention conversations
Stable performance on neurosurgery consultations
Zero output failures with acceptable computational cost
Abstract
Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p<0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Emotion and Mood Recognition
