AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

Yuhang Dai; He Wang; Xingchen Li; Zihan Zhang; Shuiyuan Wang; Lei Xie; Xin Xu; Hongxiao Guo; Shaoji Zhang; Hui Bu; Wei Chen

arXiv:2505.23036·cs.SD·May 30, 2025

AISHELL-5: The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition

Yuhang Dai, He Wang, Xingchen Li, Zihan Zhang, Shuiyuan Wang, Lei Xie, Xin Xu, Hongxiao Guo, Shaoji Zhang, Hui Bu, Wei Chen

PDF

Open Access 1 Repo

TL;DR

AISHELL-5 is the first open-source multi-channel in-car Mandarin speech dataset, enabling research on automatic speech recognition and diarization in complex driving environments with real-world noise and multi-speaker scenarios.

Contribution

This paper introduces AISHELL-5, the first comprehensive open-source in-car multi-channel Mandarin speech dataset with real driving scenarios and noise, along with a baseline system for speech separation and recognition.

Findings

01

Mainstream ASR models face challenges on AISHELL-5 data.

02

The dataset provides a new benchmark for in-car speech recognition.

03

Baseline system demonstrates the feasibility of multi-speaker in-car ASR.

Abstract

This paper delineates AISHELL-5, the first open-source in-car multi-channel multi-speaker Mandarin automatic speech recognition (ASR) dataset. AISHLL-5 includes two parts: (1) over 100 hours of multi-channel speech data recorded in an electric vehicle across more than 60 real driving scenarios. This audio data consists of four far-field speech signals captured by microphones located on each car door, as well as near-field signals obtained from high-fidelity headset microphones worn by each speaker. (2) a collection of 40 hours of real-world environmental noise recordings, which supports the in-car speech data simulation. Moreover, we also provide an open-access, reproducible baseline system based on this dataset. This system features a speech frontend model that employs speech source separation to extract each speaker's clean speech from the far-field signals, along with a speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

daiyvhang/aishell-5
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Vehicle Noise and Vibration Control