Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets
Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang,, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

TL;DR
This paper evaluates the integration of Large Language Models with automatic speech recognition on Chinese datasets, proposing a three-stage training method that achieves state-of-the-art results and provides insights for future research.
Contribution
It introduces a three-stage training approach for LLM-based ASR on Chinese data and demonstrates SOTA performance on multiple benchmarks.
Findings
Achieved state-of-the-art results on AISHELL-1, Test_Net, and Test_Meeting datasets.
Analyzed the impact of different speech encoder, LLM, and projector configurations.
Provided open-source scripts, models, and logs for reproducibility.
Abstract
Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsALIGN
