Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Akhiad Bercovich; Tomer Ronen; Talor Abramovich; Nir Ailon; Nave Assaf; Mohammad Dabbah; Ido Galil; Amnon Geifman; Yonatan Geifman; Izhak Golan; Netanel Haber; Ehud Karpas; Roi Koren; Itay Levy; Pavlo Molchanov; Shahar Mor; Zach Moshe; Najeeb Nabwani; Omri Puny; Ran Rubin; Itamar Schen; Ido Shahaf; Oren Tropp; Omer Ullman Argov; Ran Zilberstein; Ran El-Yaniv

arXiv:2411.19146·cs.LG·June 4, 2025·2 cites

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Roi Koren, Itay Levy, Pavlo Molchanov, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin

PDF

Open Access 10 Models 1 Video

TL;DR

Puzzle is a hardware-aware neural architecture search framework that distills large language models to optimize inference speed on single GPUs while maintaining high accuracy, enabling more practical deployment of large models.

Contribution

The paper introduces Puzzle, a novel NAS framework utilizing blockwise local knowledge distillation and mixed-integer programming to optimize large language models for inference efficiency.

Findings

01

Achieves 2.17x inference speedup on Llama models on a single GPU.

02

Retains 98.4% of original model accuracy after optimization.

03

Supports large batch inference with high accuracy on limited hardware.

Abstract

Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. While increasing parameter counts improves accuracy, it also broadens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We showcase our framework's impact via Llama-3.1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3.3-Nemotron-49B, two publicly available models derived from Llama-70B-Instruct. Both models achieve a 2.17x inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs· slideslive

Taxonomy

TopicsMachine Learning and Data Classification

MethodsKnowledge Distillation