Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment
Jacob Sander, Brian Jalaian, Venkat R. Dasari

TL;DR
This paper introduces an integrated framework combining quantization, low-rank adaptation, and data distillation to optimize large language models for deployment on resource-constrained devices, achieving significant size reduction and improved performance.
Contribution
The paper presents a novel pipeline that combines GPTQ-based quantization, LoRA, and a data distillation process with the Muon optimizer to enhance model compression and task-specific accuracy.
Findings
Achieves up to 2x memory compression of LLMs.
Outperforms GPTQ quantization alone on benchmark tasks.
Muon optimizer improves fine-tuned model robustness during quantization.
Abstract
Large Language Models (LLMs) enable advanced natural language processing but face deployment challenges on resource-constrained edge devices due to high computational, memory, and energy demands. Optimizing these models requires addressing three key challenges: acquiring task-specific data, fine-tuning for performance, and compressing models to accelerate inference while reducing resource demands. We propose an integrated framework combining GPTQ-based quantization, low-rank adaptation (LoRA), and a specialized data distillation process to significantly reduce model size and complexity while preserving or enhancing task-specific performance. By leveraging data distillation, knowledge distillation via Kullback-Leibler divergence, Bayesian hyperparameter optimization, and the Muon optimizer, our pipeline achieves up to 2x memory compression (e.g., reducing a 6GB model to 3GB) and enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · IoT Networks and Protocols
