Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Huanqian Wang, Yang Yue, Rui Lu, Jingxin Shi, Andrew Zhao, Shenzhi, Wang, Shiji Song, Gao Huang

TL;DR
This paper introduces a simple parameter editing method to modulate large language models' behavior, achieving significant detoxification effects without extensive retraining or fine-tuning.
Contribution
The authors demonstrate that editing a small subset of parameters can effectively change LLM behaviors, reducing toxicity and jailbreaking susceptibility with minimal computational cost.
Findings
Achieves up to 90% toxicity reduction on RealToxicityPrompts
Reduces toxicity by 49.2% on ToxiGen dataset
Maintains general LLM capabilities after parameter editing
Abstract
Large Language Models (LLMs) have demonstrated great potential as generalist assistants, showcasing powerful task understanding and problem-solving capabilities. To deploy LLMs as AI assistants, it is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. Current approaches for detoxification or preventing jailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), which requires finetuning billions of parameters through gradient descent with substantial computational cost. Furthermore, models modified through SFT and RLHF may deviate from the pretrained models, potentially leading to a degradation in foundational LLM capabilities. In this paper, we observe that surprisingly, directly editing a small subset of parameters can effectively modulate specific behaviors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCancer Genomics and Diagnostics · Orthopaedic implants and arthroplasty
MethodsShrink and Fine-Tune
