M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset
Shilong Wu

TL;DR
This paper introduces M3SD, a diverse multi-modal, multi-scenario, multi-language speaker diarization dataset created using an automated pseudo-labeling method that combines audio and video data from real network videos.
Contribution
The paper presents a novel automated dataset construction method for speaker diarization, resulting in the large-scale, diverse M3SD dataset that enhances model generalization.
Findings
M3SD dataset covers multiple languages and scenarios.
Automated pseudo-labeling improves dataset accuracy.
Open-sourced dataset facilitates research in speaker diarization.
Abstract
In the field of speaker diarization, the development of technology is constrained by two problems: insufficient data resources and poor generalization ability of deep learning models. To address these two problems, firstly, we propose an automated method for constructing speaker diarization datasets, which generates more accurate pseudo-labels for massive data through the combination of audio and video. Relying on this method, we have released Multi-modal, Multi-scenario and Multi-language Speaker Diarization (M3SD) datasets. This dataset is derived from real network videos and is highly diverse. Our dataset and code have been open-sourced at https://huggingface.co/spaces/OldDragon/m3sd.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
MethodsAdapter
