NeST: Neuron Selective Tuning for LLM Safety

Sasha Behrouzi; Lichao Wu; Mohamadreza Rostami; Ahmad-Reza Sadeghi

arXiv:2602.16835·cs.CR·February 20, 2026

NeST: Neuron Selective Tuning for LLM Safety

Sasha Behrouzi, Lichao Wu, Mohamadreza Rostami, Ahmad-Reza Sadeghi

PDF

Open Access

TL;DR

NeST is a lightweight, structure-aware safety alignment method for LLMs that selectively adapts safety neurons, significantly reducing unsafe outputs with minimal parameter updates and without extensive fine-tuning.

Contribution

NeST introduces a novel neuron clustering approach for targeted safety alignment, enabling efficient and stable safety updates without broad model modifications.

Findings

01

Reduces attack success rate from 44.5% to 4.36%.

02

Achieves over 90% reduction in unsafe generations.

03

Uses only 0.44 million trainable parameters on average.

Abstract

Safety alignment is essential for the responsible deployment of large language models (LLMs). Yet, existing approaches often rely on heavyweight fine-tuning that is costly to update, audit, and maintain across model families. Full fine-tuning incurs substantial computational and storage overhead, while parameter-efficient methods such as LoRA trade efficiency for inconsistent safety gains and sensitivity to design choices. Safety intervention mechanisms such as circuit breakers reduce unsafe outputs without modifying model weights, but do not directly shape or preserve the internal representations that govern safety behavior. These limitations hinder rapid and reliable safety updates, particularly in settings where models evolve frequently or must adapt to new policies and domains. We present NeST, a lightweight, structure-aware safety alignment framework that strengthens refusal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Physical Unclonable Functions (PUFs) and Hardware Security · Topic Modeling