Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Yuxian Gu; Qinghao Hu; Shang Yang; Haocheng Xi; Junyu Chen; Song Han; Han Cai

arXiv:2508.15884·cs.CL·September 30, 2025

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai

PDF

2 Models

TL;DR

Jet-Nemotron introduces a hybrid-architecture language model optimized via Post Neural Architecture Search, achieving high accuracy and significantly improved generation throughput compared to traditional full-attention models.

Contribution

The paper presents PostNAS, a novel neural architecture exploration pipeline that efficiently designs hybrid-architecture language models starting from pre-trained full-attention models.

Findings

01

Achieves up to 53.6x generation throughput speedup

02

Matches or exceeds accuracy of leading full-attention models

03

Outperforms recent MoE models on MMLU benchmarks

Abstract

We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.