Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression

Jacob Sander; David Moe; Achraf Cohen; Brent Venable; Venkat Dasari; Brian Jalaian

arXiv:2505.18166·cs.LG·May 27, 2025

Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression

Jacob Sander, David Moe, Achraf Cohen, Brent Venable, Venkat Dasari, Brian Jalaian

PDF

Open Access

TL;DR

This paper compares fine-tuning and distillation methods for compressing large language models with layer-wise pruning, showing that distillation can match or outperform fine-tuning in accuracy without labeled data.

Contribution

It isolates the impact of loss functions in model re-training during pruning, demonstrating the effectiveness of self-distillation over traditional fine-tuning for edge deployment.

Findings

01

KL-based distillation matches or exceeds CE fine-tuning accuracy

02

Layer-wise MLP pruning is effective for model compression

03

Self-distillation requires no labeled data for effective recovery

Abstract

Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art pruning schemes target the entire Transformer, we adopt a simple, layer-wise L2-norm pruning on only the MLP blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) Fine-tuning with Cross- Entropy (L2PFT), which requires labeled data, versus (ii) Self-Distillation with KL-divergence, which leverages only teacher logits (no labels) (L2PSD). We evaluate both pipelines on the OLMo2- 7B-SFT model for CommonsenseQA suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, KL-based distillation matches or exceeds CE fine-tuning in test…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Distributed and Parallel Computing Systems · Neural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · ADaptive gradient method with the OPTimal convergence rate · Pruning · Focus · Softmax