Per-Axis Weight Deltas for Frequent Model Updates

Stefan Kuyumdzhiev; Radostin Cholakov

arXiv:2512.19720·cs.LG·December 24, 2025

Per-Axis Weight Deltas for Frequent Model Updates

Stefan Kuyumdzhiev, Radostin Cholakov

PDF

Open Access

TL;DR

This paper introduces a 1-bit delta scheme with per-axis scaling for efficient, compact, and accurate updates of large language models, significantly reducing storage and latency during model serving.

Contribution

The authors propose a novel 1-bit delta method with per-axis FP16 scaling that improves model update compression and reduces cold-start latency without sacrificing inference efficiency.

Findings

01

Achieves better reconstruction quality than scalar delta methods.

02

Reduces storage and cold-start latency significantly.

03

Maintains inference efficiency with minimal calibration data.

Abstract

Serving many task-specialized LLM variants is often limited by the large size of fine-tuned checkpoints and the resulting cold-start latency. Since fine-tuned weights differ from their base model by relatively small structured residuals, a natural approach is to represent them as compressed deltas. We propose a simple 1-bit delta scheme that stores only the sign of the weight difference together with lightweight per-axis (row/column) FP16 scaling factors, learned from a small calibration set. This design preserves the compactness of 1-bit deltas while more accurately capturing variation across weight dimensions, leading to improved reconstruction quality over scalar alternatives. From a systems perspective, a streamlined loader that transfers packed deltas in a single operation per module reduces cold-start latency and storage overhead, with artifacts several times smaller than a full…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Embedded Systems Design Techniques