Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh

TL;DR
Nemotron Elastic introduces a novel framework for creating multi-scale, reasoning-oriented large language models with nested submodels that can be extracted zero-shot, significantly reducing training costs and maintaining high accuracy.
Contribution
The paper presents Nemotron Elastic, a method for embedding multiple nested submodels within a single LLM, enabling multi-budget deployment without additional training.
Findings
Achieved over 360x cost reduction compared to training from scratch.
Produced multiple models (9B and 6B) from a 12B model using only 110B tokens.
Nested models perform on par or better than state-of-the-art in accuracy.
Abstract
Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/NVIDIA-Nemotron-3-Nano-4B-GGUFmodel· 18k dl· ♡ 10818k dl♡ 108
- 🤗nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16model· 47k dl· ♡ 6747k dl♡ 67
- 🤗nvidia/NVIDIA-Nemotron-3-Nano-4B-FP8model· 10k dl· ♡ 1910k dl♡ 19
- 🤗unsloth/NVIDIA-Nemotron-3-Nano-4B-GGUFmodel· 26k dl· ♡ 5126k dl♡ 51
- 🤗unsloth/NVIDIA-Nemotron-3-Nano-4Bmodel· 27k dl· ♡ 927k dl♡ 9
- 🤗nvidia/Nemotron-Elastic-12Bmodel· 50 dl· ♡ 5850 dl♡ 58
- 🤗kraizytommie/Modelsmodel· 3 dl3 dl
- 🤗unsloth/NVIDIA-Nemotron-3-Nano-4B-FP8model· 1.2k dl· ♡ 21.2k dl♡ 2
- 🤗huggingworld/NVIDIA-Nemotron-3-Nano-4B-BF16-ONNXmodel· 325 dl325 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Advanced Neural Network Applications · Software-Defined Networks and 5G
