Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

Ali Taghibakhshi; Sharath Turuvekere Sreenivas; Saurav Muralidharan; Ruisi Cai; Marcin Chochowski; Ameya Sunil Mahabaleshwarkar; Yoshi Suhara; Oluwatobi Olabiyi; Daniel Korzekwa; Mostofa Patwary; Mohammad Shoeybi; Jan Kautz; Bryan Catanzaro; Ashwath Aithal; Nima Tajbakhsh; Pavlo Molchanov

arXiv:2511.16664·cs.CL·November 21, 2025

Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh

PDF

Open Access 9 Models

TL;DR

Nemotron Elastic introduces a novel framework for creating multi-scale, reasoning-oriented large language models with nested submodels that can be extracted zero-shot, significantly reducing training costs and maintaining high accuracy.

Contribution

The paper presents Nemotron Elastic, a method for embedding multiple nested submodels within a single LLM, enabling multi-budget deployment without additional training.

Findings

01

Achieved over 360x cost reduction compared to training from scratch.

02

Produced multiple models (9B and 6B) from a 12B model using only 110B tokens.

03

Nested models perform on par or better than state-of-the-art in accuracy.

Abstract

Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Advanced Neural Network Applications · Software-Defined Networks and 5G