Constrained Edge AI Deployment: Fine-Tuning vs Distillation for LLM Compression
Jacob Sander, David Moe, Achraf Cohen, Brent Venable, Venkat Dasari, Brian Jalaian

TL;DR
This paper compares fine-tuning and distillation methods for compressing large language models with layer-wise pruning, showing that distillation can match or outperform fine-tuning in accuracy without labeled data.
Contribution
It isolates the impact of loss functions in model re-training during pruning, demonstrating the effectiveness of self-distillation over traditional fine-tuning for edge deployment.
Findings
KL-based distillation matches or exceeds CE fine-tuning accuracy
Layer-wise MLP pruning is effective for model compression
Self-distillation requires no labeled data for effective recovery
Abstract
Modern foundational models are often compressed via a combination of structured pruning and re-training to meet the strict compute, memory, and connectivity constraints of edge deployments. While state-of-the-art pruning schemes target the entire Transformer, we adopt a simple, layer-wise L2-norm pruning on only the MLP blocks as a fixed baseline. Our focus is not on achieving maximal compression, but on isolating the impact of the re-training loss function: (i) Fine-tuning with Cross- Entropy (L2PFT), which requires labeled data, versus (ii) Self-Distillation with KL-divergence, which leverages only teacher logits (no labels) (L2PSD). We evaluate both pipelines on the OLMo2- 7B-SFT model for CommonsenseQA suitable for intermittent or denied connectivity scenarios typical of edge networks. Under identical pruning schedules, KL-based distillation matches or exceeds CE fine-tuning in test…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Distributed and Parallel Computing Systems · Neural Networks and Applications
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · ADaptive gradient method with the OPTimal convergence rate · Pruning · Focus · Softmax
