Qwen3-ASR Technical Report

Xian Shi; Xiong Wang; Zhifang Guo; Yongqi Wang; Pei Zhang; Xinyu Zhang; Zishan Guo; Hongkun Hao; Yu Xi; Baosong Yang; Jin Xu; Jingren Zhou; Junyang Lin

arXiv:2601.21337·cs.CL·February 2, 2026

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin

PDF

Open Access 10 Models

TL;DR

This paper introduces the Qwen3-ASR family of speech recognition models supporting 52 languages, achieving state-of-the-art performance and efficiency, along with a novel non-autoregressive forced alignment model, all released under open-source license.

Contribution

The paper presents new large-scale multilingual ASR models with state-of-the-art accuracy and a novel NAR forced aligner, enhancing real-world applicability and efficiency.

Findings

01

Qwen3-ASR-1.7B achieves SOTA open-source ASR performance.

02

Qwen3-ASR-0.6B offers optimal accuracy-efficiency balance.

03

Qwen3-ForcedAligner outperforms existing models in speed and versatility.

Abstract

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing