AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation,   Recognition and Speaker Diarization in Conference Scenario

Yihui Fu; Luyao Cheng; Shubo Lv; Yukai Jv; Yuxiang Kong; Zhuo Chen,; Yanxin Hu; Lei Xie; Jian Wu; Hui Bu; Xin Xu; Jun Du; Jingdong Chen

arXiv:2104.03603·cs.SD·August 11, 2021

AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen,, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen

PDF

4 Repos 7 Models 1 Datasets

TL;DR

AISHELL-4 is a comprehensive open-source Mandarin speech dataset recorded in conference scenarios, enabling research on multi-speaker processing, recognition, and diarization with realistic acoustics and detailed annotations.

Contribution

It introduces a large-scale, real-recorded Mandarin conference dataset with multi-task annotations and a baseline framework, filling a gap in non-English multi-speaker speech resources.

Findings

01

Provides realistic acoustics with natural speech variations.

02

Includes detailed annotations for speech and speaker activities.

03

Supports multi-task research in speech processing.

Abstract

In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

shenyunhang/AISHELL-4
dataset· 981 dl
981 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.