Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

Huzaifa Arif; Keerthiram Murugesan; Ching-Yun Ko; Pin-Yu Chen; Payel Das; Alex Gittens

arXiv:2511.08484·cs.AI·April 28, 2026

Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

Huzaifa Arif, Keerthiram Murugesan, Ching-Yun Ko, Pin-Yu Chen, Payel Das, Alex Gittens

PDF

TL;DR

This paper introduces a lightweight, modular patching method for large language models that enhances safety features efficiently without full retraining or major updates.

Contribution

The authors propose a novel prefix-based patching technique that quickly improves safety in LLMs with minimal additional parameters, enabling scalable safety updates.

Findings

01

Patch method achieves safety improvements comparable to newer models.

02

Only 0.003% additional parameters needed for safety patches.

03

Method effective across toxicity, bias, and harmfulness domains.

Abstract

We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.