D2-LoRA: A Synergistic Approach to Differential and Directional Low-Rank Adaptation
Nozomu Fujisawa, Masaaki Kondo

TL;DR
D2-LoRA is a novel parameter-efficient fine-tuning method that combines low-rank residual updates with architectural innovations, achieving high accuracy across multiple NLP tasks with minimal training data and computational overhead.
Contribution
The paper introduces D2-LoRA, a new low-rank adaptation technique that improves accuracy and stability while maintaining algebraic mergeability and low inference latency.
Findings
Achieves 76.4% accuracy on QA and reading comprehension benchmarks.
Improves average accuracy by 2.2 percentage points over LoRA.
Reduces training volatility by 36% and increases evaluation throughput by 1.91x.
Abstract
We systematically investigate the parameter-efficient fine-tuning design space under practical data and compute constraints, and propose D2-LoRA. D2-LoRA achieves 76.4 percent average accuracy across eight question answering and reading comprehension benchmarks using only 5k training samples per task and two epochs, while preserving algebraic mergeability at inference with near-exact numerical equivalence. The method combines signed low-rank residual updates with additive and subtractive components, together with a train-time column-wise projection that keeps each column close to its original norm. After training, the adapter is merged into a single weight matrix, adding zero inference latency. Compared with LoRA, D2-LoRA improves average accuracy by 2.2 percentage points; at matched parameter counts (LoRA rank 2r versus D2-LoRA rank r), the improvement is 1.6 points, indicating gains…
Peer Reviews
Decision·Submitted to ICLR 2026
1. D2-LoRA represents a novel synthesis of ideas. It intelligently combines an expressive signed residual—allowing the model to not only learn new features but also explicitly suppress pre-existing ones—with a stability-enhancing directional constraint. 2. The authors provide detailed ablation studies that methodically dissect the architecture's performance. 3. The evaluation is conducted with commendable rigor under a strict and realistic budget: a maximum of 5,000 training samples per task a
1. While the focus on the low-data regime is well-motivated, the experiments are confined to QA/RC tasks. This is a reasonable scope, but the authors themselves note in Section 9 that evaluation on "Broader modalities and RLHF-style pipelines are left for future work." 2. The sensitivity of the τ hyperparameter, which balances the positive and negative branches, appears to be model-dependent. Data from Table 7 shows that the optimal value is 1.0 for Llama-3.2-3B-Instruct but 0.5 for Qwen2.5-7B-
- The method is straightforward to implement—combining a signed low-rank residual with a training-time directional projection—and remains mergeable at inference, imposing minimal engineering overhead within standard PEFT pipelines. - In low-data, small-rank settings, it consistently improves accuracy over baseline LoRA and stays competitive with related variants across multiple QA/RC tasks and backbones under tight training budgets, with ablations indicating robustness to rank and module choices
- Parameter-count fairness is not fully addressed: comparisons primarily match D²-LoRA at rank r against LoRA at the same r, without a parameter-matched LoRA (e.g., 2r) or equivalent-capacity baselines to isolate architectural benefits from increased parameterization. - The evaluation scope is limited to multiple-choice QA/RC benchmarks and does not assess open-ended generation, instruction following, code/math reasoning, or multilingual settings, which may exhibit different behaviors and trade-
1. The paper presents a well-motivated synthesis of concepts from LoRA and DoRA. The introduction of a signed residual to provide subtractive capacity is a logical extension, and the use of a directional projection only during training is a clever mechanism to gain stability benefits without sacrificing inference-time mergeability and efficiency. 2. The work is explicitly framed around practical, budget-constrained fine-tuning, using a small number of training samples and epochs. The validation
1. The negative branch is modulated by a scalar α̃, which is a critical new hyperparameter. The main experiments fix α̃=0.5, but the ablation study (Table 7) shows that α̃=1.0 yields better results on one of the backbones. This suggests performance is sensitive to this choice. 2. All eight evaluation benchmarks are question-answering or reading comprehension tasks that can be framed as multiple-choice classification. The method's effectiveness on more open-ended, generative tasks (e.g., summari
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
