NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

Yuan Xie; Jiaqi Song; Guang Qiu; Xianliang Wang; Kai Qiao; Junfeng Yuan; Shengqing Liu; Yi Zhang; Bowen Chen; Ming Lei; Jie Gao; and Jie Wu

arXiv:2604.18105·eess.AS·April 21, 2026

NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, and Jie Wu

PDF

TL;DR

NIM4-ASR is a production-oriented, efficient, and robust LLM-based ASR framework that enhances recognition quality, scalability, and customization for real-time applications.

Contribution

The paper introduces NIM4-ASR, a novel framework with redesigned training and optimization strategies for improved efficiency, robustness, and hotword customization in LLM-based ASR.

Findings

01

Achieves state-of-the-art performance with only 2.3B parameters.

02

Outperforms larger models on internal benchmarks, especially in entity-rich scenarios.

03

Supports million-scale hotword customization with sub-millisecond retrieval latency.

Abstract

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.