Domain Adaptation of the Pyannote Diarization Pipeline for Conversational Indonesian Audio

Muhammad Daffa'i Rafi Prasetyo; Ramadhan Andika Putra; Zaidan Naufal Ilmi; Kurniawati Azizah

arXiv:2601.03684·cs.SD·January 8, 2026

Domain Adaptation of the Pyannote Diarization Pipeline for Conversational Indonesian Audio

Muhammad Daffa'i Rafi Prasetyo, Ramadhan Andika Putra, Zaidan Naufal Ilmi, Kurniawati Azizah

PDF

Open Access

TL;DR

This paper develops a domain adaptation method for speaker diarization in conversational Indonesian audio by using synthetic data generated via neural TTS, significantly improving DER over the baseline.

Contribution

It introduces a novel approach to adapt an English-centric diarization pipeline to Indonesian using synthetic speech data, demonstrating substantial performance gains.

Findings

01

Baseline DER of 53.47% on Indonesian zero-shot

02

Synthetic data reduces DER to around 34% with small datasets

03

Largest dataset achieves DER of 29.24%, a 13.68% improvement

Abstract

This study presents a domain adaptation approach for speaker diarization targeting conversational Indonesian audio. We address the challenge of adapting an English-centric diarization pipeline to a low-resource language by employing synthetic data generation using neural Text-to-Speech technology. Experiments were conducted with varying training configurations, a small dataset (171 samples) and a large dataset containing 25 hours of synthetic speech. Results demonstrate that the baseline \texttt{pyannote/segmentation-3.0} model, trained on the AMI Corpus, achieves a Diarization Error Rate (DER) of 53.47\% when applied zero-shot to Indonesian. Domain adaptation significantly improves performance, with the small dataset models reducing DER to 34.31\% (1 epoch) and 34.81\% (2 epochs). The model trained on the 25-hour dataset achieves the best performance with a DER of 29.24\%, representing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Music and Audio Processing