Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models
Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens

TL;DR
This paper introduces a lightweight, modular patching method for large language models that enhances safety features efficiently without full retraining or major updates.
Contribution
The authors propose a novel prefix-based patching technique that quickly improves safety in LLMs with minimal additional parameters, enabling scalable safety updates.
Findings
Patch method achieves safety improvements comparable to newer models.
Only 0.003% additional parameters needed for safety patches.
Method effective across toxicity, bias, and harmfulness domains.
Abstract
We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
