M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset

Shilong Wu

arXiv:2506.14427·eess.AS·July 1, 2025

M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset

Shilong Wu

PDF

Open Access

TL;DR

This paper introduces M3SD, a diverse multi-modal, multi-scenario, multi-language speaker diarization dataset created using an automated pseudo-labeling method that combines audio and video data from real network videos.

Contribution

The paper presents a novel automated dataset construction method for speaker diarization, resulting in the large-scale, diverse M3SD dataset that enhances model generalization.

Findings

01

M3SD dataset covers multiple languages and scenarios.

02

Automated pseudo-labeling improves dataset accuracy.

03

Open-sourced dataset facilitates research in speaker diarization.

Abstract

In the field of speaker diarization, the development of technology is constrained by two problems: insufficient data resources and poor generalization ability of deep learning models. To address these two problems, firstly, we propose an automated method for constructing speaker diarization datasets, which generates more accurate pseudo-labels for massive data through the combination of audio and video. Relying on this method, we have released Multi-modal, Multi-scenario and Multi-language Speaker Diarization (M3SD) datasets. This dataset is derived from real network videos and is highly diverse. Our dataset and code have been open-sourced at https://huggingface.co/spaces/OldDragon/m3sd.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques

MethodsAdapter