Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference
Mohammed Sabry, Anya Belz

TL;DR
This paper introduces Budgeted LoRA, a distillation framework that allocates compute resources between dense and low-rank pathways to produce efficient inference models under explicit compute constraints.
Contribution
It formulates model compression as a structured compute allocation problem, enabling flexible trade-offs between speed and accuracy during distillation.
Findings
Achieves up to 4.05x speedup with moderate perplexity degradation.
Matches standard LoRA perplexity at a 1.74x compressed-module speedup.
Preserves higher accuracy on in-context learning probes under compute constraints.
Abstract
We study distillation for large language models under explicit compute constraints, with the goal of producing student models that are not only cheaper to train, but structurally efficient at inference time. While prior approaches to parameter-efficient distillation, such as LoRA, reduce adaptation cost, they leave the dense backbone unchanged and therefore fail to deliver meaningful inference savings. We propose Budgeted LoRA, a distillation framework that treats model compression as a structured compute allocation problem. Instead of using a fixed student architecture, we introduce a global compute budget that sets the final target fraction of dense computation retained. Under this constraint, the model redistributes capacity across dense and low-rank pathways via (i) module-level dense retention coefficients, (ii) adaptive low-rank allocation, and (iii) post-training compression that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
