Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment

Jacob Sander; Brian Jalaian; Venkat R. Dasari

arXiv:2601.09865·cs.LG·January 16, 2026

Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment

Jacob Sander, Brian Jalaian, Venkat R. Dasari

PDF

Open Access

TL;DR

This paper introduces an integrated framework combining quantization, low-rank adaptation, and data distillation to optimize large language models for deployment on resource-constrained devices, achieving significant size reduction and improved performance.

Contribution

The paper presents a novel pipeline that combines GPTQ-based quantization, LoRA, and a data distillation process with the Muon optimizer to enhance model compression and task-specific accuracy.

Findings

01

Achieves up to 2x memory compression of LLMs.

02

Outperforms GPTQ quantization alone on benchmark tasks.

03

Muon optimizer improves fine-tuned model robustness during quantization.

Abstract

Large Language Models (LLMs) enable advanced natural language processing but face deployment challenges on resource-constrained edge devices due to high computational, memory, and energy demands. Optimizing these models requires addressing three key challenges: acquiring task-specific data, fine-tuning for performance, and compressing models to accelerate inference while reducing resource demands. We propose an integrated framework combining GPTQ-based quantization, low-rank adaptation (LoRA), and a specialized data distillation process to significantly reduce model size and complexity while preserving or enhancing task-specific performance. By leveraging data distillation, knowledge distillation via Kullback-Leibler divergence, Bayesian hyperparameter optimization, and the Muon optimizer, our pipeline achieves up to 2x memory compression (e.g., reducing a 6GB model to 3GB) and enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · IoT Networks and Protocols