The USTC-NERCSLIP Systems for The ICMC-ASR Challenge

Minghui Wu; Luzhen Xu; Jie Zhang; Haitao Tang; Yanyan Yue; Ruizhi; Liao; Jintao Zhao; Zhengzhe Zhang; Yichi Wang; Haoyin Yan; Hongliang Yu,; Tongle Ma; Jiachen Liu; Chongliang Wu; Yongchao Li; Yanyong Zhang; Xin Fang,; Yue Zhang

arXiv:2407.02052·eess.AS·July 3, 2024

The USTC-NERCSLIP Systems for The ICMC-ASR Challenge

Minghui Wu, Luzhen Xu, Jie Zhang, Haitao Tang, Yanyan Yue, Ruizhi, Liao, Jintao Zhao, Zhengzhe Zhang, Yichi Wang, Haoyin Yan, Hongliang Yu,, Tongle Ma, Jiachen Liu, Chongliang Wu, Yongchao Li, Yanyong Zhang, Xin Fang,, Yue Zhang

PDF

Open Access

TL;DR

This paper presents a multi-channel speech recognition system for in-car scenarios with overlapping speakers and Mandarin accents, utilizing self-supervised embeddings, beamforming, iterative pseudo-labeling, and an accent-aware framework, achieving top performance in the ICMC-ASR challenge.

Contribution

The system introduces a novel combination of self-supervised multi-speaker embeddings, iterative pseudo-labeling, and an accent-aware ASR framework for challenging in-car multi-speaker Mandarin recognition.

Findings

01

Achieved 13.16% CER on track 1

02

Achieved 21.48% cpCER on track 2

03

Outperformed baseline and ranked first in the challenge

Abstract

This report describes the submitted system to the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) challenge, which considers the ASR task with multi-speaker overlapping and Mandarin accent dynamics in the ICMC case. We implement the front-end speaker diarization using the self-supervised learning representation based multi-speaker embedding and beamforming using the speaker position, respectively. For ASR, we employ an iterative pseudo-label generation method based on fusion model to obtain text labels of unsupervised data. To mitigate the impact of accent, an Accent-ASR framework is proposed, which captures pronunciation-related accent features at a fine-grained level and linguistic information at a coarse-grained level. On the ICMC-ASR eval set, the proposed system achieves a CER of 13.16% on track 1 and a cpCER of 21.48% on track 2, which significantly outperforms the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis