Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Zhifei Xie; Kaiyu Pang; Haobin Zhang; Deheng Ye; Xiaobin Hu; Shuicheng Yan; and Chunyan Miao

arXiv:2605.19833·cs.SD·May 20, 2026

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Zhifei Xie, Kaiyu Pang, Haobin Zhang, Deheng Ye, Xiaobin Hu, Shuicheng Yan, and Chunyan Miao

PDF

1 Repo 2 Models 1 Datasets

TL;DR

Mega-ASR introduces a scalable framework combining real-world acoustic simulation and progressive training to significantly improve robustness of speech recognition in adverse and complex environments.

Contribution

It presents a novel in-the-wild ASR framework with a large diverse dataset and advanced training methods for enhanced robustness.

Findings

01

Mega-ASR outperforms prior systems on adverse-condition benchmarks.

02

Achieves over 30% relative WER reduction on complex acoustic scenarios.

03

Demonstrates scalable paradigm for robust in-the-wild speech recognition.

Abstract

Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under severe, compositional distortions. We propose Mega-ASR, a unified ASR-in-the-wild framework that combines scalable compound-data construction with progressive acoustic-to-semantic optimization. We introduce Voices-in-the-Wild-2M, covering 7 classic acoustic phenomena and 54 physically plausible compound scenarios, and train Mega-ASR with Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization. Extensive experiments demonstrate that Mega-ASR achieves significant advantages over prior state-of-the-art systems on adverse-condition ASR benchmarks (45.69% vs. 54.01%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xzf-thu/Mega-ASR
github

Models

Datasets

zhifeixie/Voices-in-the-Wild-2M
dataset· 12k dl
12k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.