LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue; Baharan Mirzasoleiman

arXiv:2507.17075·cs.AI·February 3, 2026

LoRA is All You Need for Safety Alignment of Reasoning LLMs

Yihao Xue, Baharan Mirzasoleiman

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that applying LoRA during supervised fine-tuning on refusal datasets effectively balances safety and reasoning performance in large language models, across multiple tasks and architectures.

Contribution

The study shows that LoRA can be used during safety fine-tuning to maintain reasoning abilities while improving safety, with detailed ablations and theoretical insights.

Findings

01

LoRA achieves safety comparable to full-model alignment.

02

Rank-1 updates are sufficient for optimal safety-reasoning trade-off.

03

Applying LoRA to MLP layers outperforms full-layer updates.

Abstract

Reasoning-capable LLMs have achieved major breakthroughs in solving complex problems, but recent work shows that acquiring and deploying strong reasoning can introduce significant safety risks. A common mitigation is to apply a secondary safety-alignment phase after reasoning is learned; however, safety alignment often degrades reasoning performance--a phenomenon known as the "Safety Tax". In this work, we show that a simple approach can largely bypass this trade-off: applying LoRA during SFT on refusal datasets. Despite its simplicity, this recipe achieves safety comparable to full-model alignment while preserving reasoning performance close to the original reasoning-tuned model, and the result holds across multiple model sizes and architectures, two safety benchmarks, and four reasoning benchmarks spanning mathematics, science, and code generation. We further ablate LoRA…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The paper shows simple yet effective usage of LoRA in finetuning for safety alignment. - The results show that the LoRA-trained model is both safe and has high reasoning performance in different benchmarks.

Weaknesses

- The paper does not exhibit weight-level selectivity; instead, it adopts a more coarse-grained perspective, assuming that all parameters contribute collectively to the model’s reasoning capability. The selectivity applied is primarily at the layer or module level. - The paper lacks of theoretical ground for its claims and has an experimental approach. - The proposed post-hoc method doesn't improve the reasoning performance, but the authors also claim that the method needs more development. -

Reviewer 02Rating 4Confidence 3

Strengths

(1) This empirical study is careful and fairly comprehensive. Results on multiple benchmarks are reported. (2) The findings that Lora can successfully avoid the trade-off between reasoning ability and model safety is interesting and useful. (3) This paper is well-structured and easy to follow.

Weaknesses

(1) The benchmarked models lack diversity — all reasoning models are derived from DeepSeek. It remains unclear whether the findings generalize to other reasoning models or architectures, such as GPT-OSS-20B or GPT-OSS-120B. (2) This paper lacks theoretical analysis explaining why the LoRA technique can mitigate the “safety tax” issue. A deeper investigation or theoretical justification would strengthen the claims. (3) The safety evaluation pipeline may have limitations. Safety is automatically

Reviewer 03Rating 4Confidence 4

Strengths

- Demonstrates that LoRA-only fine-tuning can mitigate the Safety Tax while preserving reasoning ability — an interesting and practical finding. - Well-designed ablations reveal key factors (rank = 1, MLP up-projection, middle layers) that contribute most to the reasoning–safety trade-off. - Provides a geometric perspective on why LoRA interferes less with reasoning, through alignment and subspace analyses.

Weaknesses

- The paper makes a strong claim that “LoRA is all you need” to address the safety–reasoning trade-off. While the presented results are intriguing, the current experimental scope is insufficient to substantiate this claim. A wider range of backbones and model sizes should be evaluated to demonstrate consistency across architectures and scales (R1-1.5 ~ R1-32B, s1, Qwen3, etc). - Moreover, prior analyses (Jain et al., 2024; Wei et al., 2024) about low-rank safety directions generalize to general

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies