Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Alexandra Zelenin; Alexandra Zhuravlyova

arXiv:2603.22276·cs.LG·March 24, 2026

Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Alexandra Zelenin, Alexandra Zhuravlyova

PDF

Open Access 1 Datasets

TL;DR

This paper introduces a memory-efficient and faster implementation of high-rank DoRA by decomposing norms and fusing kernels, enabling scalable adaptation in large vision-language models.

Contribution

It proposes a factored norm decomposition and fused Triton kernels to significantly reduce memory and computation costs of high-rank DoRA.

Findings

01

Fused implementation is 1.5-2.0x faster than existing DoRA for inference.

02

Reduces peak VRAM usage by up to 7 GB.

03

Maintains high similarity and training stability across multiple models and GPUs.

Abstract

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

eyes-ml/MMFineReason-SFT-123K-Qwen3-VL-235B-Thinking-QR-max4096
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Stochastic Gradient Optimization Techniques