FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System
Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao Hu

TL;DR
FireRedASR2S is an integrated, industrial-grade speech recognition system that combines multiple modules to achieve state-of-the-art accuracy across languages, dialects, and tasks, with open-source release.
Contribution
The paper introduces FireRedASR2S, a comprehensive all-in-one ASR system with new modules supporting multiple languages, dialects, and tasks, outperforming existing baselines.
Findings
Achieves 2.89% CER on Mandarin benchmarks.
VAD module with 97.57% frame-level F1 score.
LID module with 97.18% accuracy on 82 languages.
Abstract
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing
