Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based   Speech Recognition

Ye Bai; Jingping Chen; Jitong Chen; Wei Chen; Zhuo Chen; Chuang Ding,; Linhao Dong; Qianqian Dong; Yujiao Du; Kepan Gao; Lu Gao; Yi Guo; Minglun; Han; Ting Han; Wenchao Hu; Xinying Hu; Yuxiang Hu; Deyu Hua; Lu Huang,; Mingkun Huang; Youjia Huang; Jishuo Jin; Fanliu Kong; Zongwei Lan; Tianyu Li,; Xiaoyang Li; Zeyang Li; Zehua Lin; Rui Liu; Shouda Liu; Lu Lu; Yizhou Lu,; Jingting Ma; Shengtao Ma; Yulin Pei; Chen Shen; Tian Tan; Xiaogang Tian; Ming; Tu; Bo Wang; Hao Wang; Yuping Wang; Yuxuan Wang; Hanzhang Xia; Rui Xia,; Shuangyi Xie; Hongmin Xu; Meng Yang; Bihong Zhang; Jun Zhang; Wanyi Zhang,; Yang Zhang; Yawei Zhang; Yijie Zheng; Ming Zou

arXiv:2407.04675·eess.AS·July 11, 2024·5 cites

Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition

Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding,, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun, Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang,, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong

PDF

Open Access

TL;DR

Seed-ASR introduces an LLM-based speech recognition framework that effectively handles diverse speech signals and contextual information, outperforming traditional models across multiple languages, domains, and accents with significant error rate reductions.

Contribution

This work presents a novel LLM-based speech recognition model, Seed-ASR, leveraging audio-conditioned LLMs and stage-wise training to enhance diversity handling without extra language models.

Findings

01

Achieves 10%-40% reduction in word error rates on Chinese and English test sets.

02

Demonstrates superior performance across multiple domains, accents, and languages.

03

Supports scenario-specific deployment without additional language models.

Abstract

Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques