Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End   Multi-Accent Speech Recognition

Jinming Chen; Jingyi Fang; Yuanzhong Zheng; Yaoxuan Wang; Haojun Fei

arXiv:2407.03026·cs.SD·July 4, 2024

Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition

Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei

PDF

Open Access

TL;DR

Qifusion-Net is a novel end-to-end multi-accent speech recognition model that adaptively fuses features without prior accent knowledge, improving accuracy across diverse accents.

Contribution

It introduces a layer-adapted fusion model with dynamic chunk strategy for streaming, multi-accent recognition without needing prior accent information.

Findings

01

Achieved 22.1% and 17.2% CER reduction on KeSpeech and MagicData-RMAC datasets.

02

Outperformed baseline models in multi-accent speech recognition tasks.

03

Enabled fine-grained feature fusion for improved recognition accuracy.

Abstract

Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1 $%$ and 17.2 $%$ in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing