Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
Alexandra Zelenin, Alexandra Zhuravlyova

TL;DR
This paper introduces a memory-efficient and faster implementation of high-rank DoRA by decomposing norms and fusing kernels, enabling scalable adaptation in large vision-language models.
Contribution
It proposes a factored norm decomposition and fused Triton kernels to significantly reduce memory and computation costs of high-rank DoRA.
Findings
Fused implementation is 1.5-2.0x faster than existing DoRA for inference.
Reduces peak VRAM usage by up to 7 GB.
Maintains high similarity and training stability across multiple models and GPUs.
Abstract
Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques
