Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition
Jinming Chen, Jingyi Fang, Yuanzhong Zheng, Yaoxuan Wang, Haojun Fei

TL;DR
Qifusion-Net is a novel end-to-end multi-accent speech recognition model that adaptively fuses features without prior accent knowledge, improving accuracy across diverse accents.
Contribution
It introduces a layer-adapted fusion model with dynamic chunk strategy for streaming, multi-accent recognition without needing prior accent information.
Findings
Achieved 22.1% and 17.2% CER reduction on KeSpeech and MagicData-RMAC datasets.
Outperformed baseline models in multi-accent speech recognition tasks.
Enabled fine-grained feature fusion for improved recognition accuracy.
Abstract
Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1 and 17.2 in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
