NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Kai Qiao, Junfeng Yuan, Shengqing Liu, Yi Zhang, Bowen Chen, Ming Lei, Jie Gao, and Jie Wu

TL;DR
NIM4-ASR is a production-oriented, efficient, and robust LLM-based ASR framework that enhances recognition quality, scalability, and customization for real-time applications.
Contribution
The paper introduces NIM4-ASR, a novel framework with redesigned training and optimization strategies for improved efficiency, robustness, and hotword customization in LLM-based ASR.
Findings
Achieves state-of-the-art performance with only 2.3B parameters.
Outperforms larger models on internal benchmarks, especially in entity-rich scenarios.
Supports million-scale hotword customization with sub-millisecond retrieval latency.
Abstract
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, leaving key practical challenges insufficiently addressed -- particularly limited downward scalability in resource-constrained deployments and hallucinations under acoustically challenging conditions. To address these issues, we present NIM4-ASR, a production-oriented LLM-based ASR framework optimized for both efficiency and robustness. Grounded in a principled delineation of functional roles between the encoder and the LLM, we redesign the multi-stage training paradigm to align each module with its intended capability boundary. Specifically, we reformulate the pre-training architecture and objective to mitigate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
