FireRedASR: Open-Source Industrial-Grade Mandarin Speech Recognition Models from Encoder-Decoder to LLM Integration
Kai-Tuo Xu, Feng-Long Xie, Xu Tang, Yao Hu

TL;DR
FireRedASR introduces large-scale Mandarin speech recognition models, including an LLM-integrated variant for high accuracy and an efficient encoder-decoder model, achieving state-of-the-art results and broad applicability.
Contribution
The paper presents FireRedASR, a new family of Mandarin ASR models with LLM integration and efficient architecture, surpassing existing SOTA performance and supporting diverse speech recognition scenarios.
Findings
FireRedASR-LLM achieves 3.05% CER, surpassing SOTA by 8.4%.
FireRedASR-AED achieves 3.18% CER, outperforming larger models.
Both models perform well on dialects, English speech, and singing lyrics.
Abstract
We present FireRedASR, a family of large-scale automatic speech recognition (ASR) models for Mandarin, designed to meet diverse requirements in superior performance and optimal efficiency across various applications. FireRedASR comprises two variants: FireRedASR-LLM: Designed to achieve state-of-the-art (SOTA) performance and to enable seamless end-to-end speech interaction. It adopts an Encoder-Adapter-LLM framework leveraging large language model (LLM) capabilities. On public Mandarin benchmarks, FireRedASR-LLM (8.3B parameters) achieves an average Character Error Rate (CER) of 3.05%, surpassing the latest SOTA of 3.33% with an 8.4% relative CER reduction (CERR). It demonstrates superior generalization capability over industrial-grade baselines, achieving 24%-40% CERR in multi-source Mandarin ASR scenarios such as video, live, and intelligent assistant. FireRedASR-AED: Designed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
