NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
NVIDIA: Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton

TL;DR
The paper presents Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer model that offers high inference throughput and state-of-the-art reasoning accuracy for long-context tasks, enabling efficient large-scale reasoning on standard GPUs.
Contribution
Introduction of Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer model with improved inference speed and accuracy for reasoning workloads, including model compression and deployment strategies.
Findings
Achieves up to 6x higher inference throughput compared to similar-sized models.
Maintains or surpasses state-of-the-art accuracy on reasoning benchmarks.
Enables reasoning with up to 128k tokens on a single GPU.
Abstract
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/NVIDIA-Nemotron-Nano-9B-v2model· 429k dl· ♡ 487429k dl♡ 487
- 🤗nvidia/NVIDIA-Nemotron-Nano-12B-v2model· 30k dl· ♡ 16130k dl♡ 161
- 🤗cpagac/Nemotron-Nano-9B-v2-hereticmodel· 278 dl· ♡ 3278 dl♡ 3
- 🤗cyankiwi/NVIDIA-Nemotron-Nano-9B-v2-AWQ-4bitmodel· 389 dl· ♡ 3389 dl♡ 3
- 🤗unsloth/NVIDIA-Nemotron-Nano-9B-v2model· 617 dl· ♡ 3617 dl♡ 3
- 🤗nvidia/NVIDIA-Nemotron-Nano-12B-v2-Basemodel· 2.5k dl· ♡ 892.5k dl♡ 89
- 🤗nvidia/NVIDIA-Nemotron-Nano-9B-v2-Basemodel· 145k dl· ♡ 43145k dl♡ 43
- 🤗dominguesm/NVIDIA-Nemotron-Nano-9B-v2-GGUFmodel· 643 dl· ♡ 1643 dl♡ 1
- 🤗gabriellarson/NVIDIA-Nemotron-Nano-12B-v2-GGUFmodel· 193 dl193 dl
- 🤗QuantFactory/NVIDIA-Nemotron-Nano-9B-v2-GGUFmodel· 570 dl· ♡ 4570 dl♡ 4
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
