Seed-ASR: Understanding Diverse Speech and Contexts with LLM-based Speech Recognition
Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding,, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, Lu Gao, Yi Guo, Minglun, Han, Ting Han, Wenchao Hu, Xinying Hu, Yuxiang Hu, Deyu Hua, Lu Huang,, Mingkun Huang, Youjia Huang, Jishuo Jin, Fanliu Kong

TL;DR
Seed-ASR introduces an LLM-based speech recognition framework that effectively handles diverse speech signals and contextual information, outperforming traditional models across multiple languages, domains, and accents with significant error rate reductions.
Contribution
This work presents a novel LLM-based speech recognition model, Seed-ASR, leveraging audio-conditioned LLMs and stage-wise training to enhance diversity handling without extra language models.
Findings
Achieves 10%-40% reduction in word error rates on Chinese and English test sets.
Demonstrates superior performance across multiple domains, accents, and languages.
Supports scenario-specific deployment without additional language models.
Abstract
Modern automatic speech recognition (ASR) model is required to accurately transcribe diverse speech signals (from different domains, languages, accents, etc) given the specific contextual information in various application scenarios. Classic end-to-end models fused with extra language models perform well, but mainly in data matching scenarios and are gradually approaching a bottleneck. In this work, we introduce Seed-ASR, a large language model (LLM) based speech recognition model. Seed-ASR is developed based on the framework of audio conditioned LLM (AcLLM), leveraging the capabilities of LLMs by inputting continuous speech representations together with contextual information into the LLM. Through stage-wise large-scale training and the elicitation of context-aware capabilities in LLM, Seed-ASR demonstrates significant improvement over end-to-end models on comprehensive evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques
