Qwen3-ASR Technical Report
Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin

TL;DR
This paper introduces the Qwen3-ASR family of speech recognition models supporting 52 languages, achieving state-of-the-art performance and efficiency, along with a novel non-autoregressive forced alignment model, all released under open-source license.
Contribution
The paper presents new large-scale multilingual ASR models with state-of-the-art accuracy and a novel NAR forced aligner, enhancing real-world applicability and efficiency.
Findings
Qwen3-ASR-1.7B achieves SOTA open-source ASR performance.
Qwen3-ASR-0.6B offers optimal accuracy-efficiency balance.
Qwen3-ForcedAligner outperforms existing models in speed and versatility.
Abstract
In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Qwen/Qwen3-ASR-1.7Bmodel· 1.1M dl· ♡ 6411.1M dl♡ 641
- 🤗Qwen/Qwen3-ASR-0.6Bmodel· 429k dl· ♡ 255429k dl♡ 255
- 🤗Daumee/Qwen3-ASR-0.6B-ONNX-CPUmodel· 360 dl· ♡ 5360 dl♡ 5
- 🤗vrfai/qwen3asr-fp8model· 36 dl· ♡ 136 dl♡ 1
- 🤗vrfai/qwen3asr-nvfp4model· 37 dl· ♡ 137 dl♡ 1
- 🤗Qwen/Qwen3-ForcedAligner-0.6Bmodel· 173k dl· ♡ 102173k dl♡ 102
- 🤗FluidInference/qwen3-asr-0.6b-coremlmodel· 702 dl· ♡ 7702 dl♡ 7
- 🤗Accordic/qwen3-asr-1-7b-modelmodel· 13 dl13 dl
- 🤗Accordic/qwen3-forcedaligner-0-6b-modelmodel· 1 dl1 dl
- 🤗Tomhn/Qwen3-ASR-1.7Bmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
