The RoyalFlush Automatic Speech Diarization and Recognition System for   In-Car Multi-Channel Automatic Speech Recognition Challenge

Jingguang Tian; Shuaishuai Ye; Shunfei Chen; Yang Xiang; Zhaohui Yin,; Xinhui Hu; Xinkang Xu

arXiv:2405.05498·cs.SD·May 10, 2024

The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge

Jingguang Tian, Shuaishuai Ye, Shunfei Chen, Yang Xiang, Zhaohui Yin,, Xinhui Hu, Xinkang Xu

PDF

Open Access

TL;DR

This paper introduces an end-to-end system for multi-channel in-car speech recognition and diarization, significantly reducing errors and improving accuracy in complex multi-speaker scenarios.

Contribution

We develop novel end-to-end diarization models that greatly lower diarization errors and integrate self-supervised learning for improved speech recognition in challenging environments.

Findings

01

Diarization error rate reduced by 49.58%

02

Character error rate achieved is 16.93% on evaluation set

03

cpCER of 25.88% on track 2 evaluation set

Abstract

This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93\% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88\% on the track 2 evaluation set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing