Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Xuelong Geng; Tianyi Xu; Kun Wei; Bingshen Mu; Hongfei Xue; He Wang,; Yangze Li; Pengcheng Guo; Yuhang Dai; Longhao Li; Mingchen Shao; Lei Xie

arXiv:2405.02132·cs.SD·November 6, 2024

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang,, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the integration of Large Language Models with automatic speech recognition on Chinese datasets, proposing a three-stage training method that achieves state-of-the-art results and provides insights for future research.

Contribution

It introduces a three-stage training approach for LLM-based ASR on Chinese data and demonstrates SOTA performance on multiple benchmarks.

Findings

01

Achieved state-of-the-art results on AISHELL-1, Test_Net, and Test_Meeting datasets.

02

Analyzed the impact of different speech encoder, LLM, and projector configurations.

03

Provided open-source scripts, models, and logs for reproducibility.

Abstract

Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gengxuelong/wenet_LLM_from_ASLP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsALIGN