Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Roi Koren, Itay Levy, Pavlo Molchanov, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin

TL;DR
Puzzle is a hardware-aware neural architecture search framework that distills large language models to optimize inference speed on single GPUs while maintaining high accuracy, enabling more practical deployment of large models.
Contribution
The paper introduces Puzzle, a novel NAS framework utilizing blockwise local knowledge distillation and mixed-integer programming to optimize large language models for inference efficiency.
Findings
Achieves 2.17x inference speedup on Llama models on a single GPU.
Retains 98.4% of original model accuracy after optimization.
Supports large batch inference with high accuracy on limited hardware.
Abstract
Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. While increasing parameter counts improves accuracy, it also broadens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models derived from Llama-70B-Instruct. Both models achieve a 2.17x inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nvidia/gpt-oss-puzzle-88Bmodel· 15k dl· ♡ 8915k dl♡ 89
- 🤗SamPurkis/gpt-oss-puzzle-88B-GGUFmodel· 2.3k dl· ♡ 72.3k dl♡ 7
- 🤗nvidia/Llama-3_3-Nemotron-Super-49B-v1_5-FP8model· 48k dl· ♡ 2648k dl♡ 26
- 🤗nvidia/Llama-3_1-Nemotron-Ultra-253B-v1model· 2.0k dl· ♡ 3442.0k dl♡ 344
- 🤗nvidia/Llama-3_1-Nemotron-Ultra-253B-v1-FP8model· 2.1k dl· ♡ 112.1k dl♡ 11
- 🤗nvidia/Llama-3_3-Nemotron-Super-49B-v1-FP8model· 990 dl· ♡ 12990 dl♡ 12
- 🤗nvidia/Llama-3_3-Nemotron-Super-49B-v1_5-NVFP4model· 7.6k dl· ♡ 167.6k dl♡ 16
- 🤗nvidia/Llama-3_3-Nemotron-Super-49B-v1model· 32k dl· ♡ 32132k dl♡ 321
- 🤗Mungert/Llama-3_3-Nemotron-Super-49B-v1-GGUFmodel· 50 dl· ♡ 550 dl♡ 5
- 🤗nvidia/Llama-3_1-Nemotron-Ultra-253B-CPT-v1model· 69 dl· ♡ 669 dl♡ 6
Videos
Taxonomy
TopicsMachine Learning and Data Classification
MethodsKnowledge Distillation
