Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models
Yonggan Fu, Xin Dong, Shizhe Diao, Matthijs Van keirsbilck, Hanrong Ye, Wonmin Byeon, Yashaswi Karnati, Lucas Liebenwein, Hannah Zhang, Nikolaus Binder, Maksim Khadkevich, Alexander Keller, Jan Kautz, Yingyan Celine Lin, Pavlo Molchanov

TL;DR
Nemotron-Flash is a new hybrid small language model family optimized for real-device latency, combining architectural principles, operator choices, and training techniques to significantly improve accuracy and efficiency over existing models.
Contribution
The paper introduces a comprehensive methodology for designing latency-optimized hybrid small language models, including architectural insights, operator selection, and training improvements, culminating in the Nemotron-Flash family.
Findings
Achieved over +5.5% average accuracy improvement.
Reduced latency by 1.3x to 1.9x.
Increased throughput by 18.7x to 45.6x.
Abstract
Efficient deployment of small language models (SLMs) is essential for numerous real-world applications with stringent latency constraints. While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups. This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training when real-device latency is the primary consideration. Specifically, we identify two central architectural factors: depth-width ratios and operator choices. The former is crucial for small-batch-size latency, while the latter affects both latency and large-batch-size throughput. In light of this, we first study latency-optimal depth-width ratios, with the key finding that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Big Data and Digital Economy
