Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

Huanqian Wang; Yang Yue; Rui Lu; Jingxin Shi; Andrew Zhao; Shenzhi; Wang; Shiji Song; Gao Huang

arXiv:2407.08770·cs.AI·February 12, 2025·1 cites

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing

Huanqian Wang, Yang Yue, Rui Lu, Jingxin Shi, Andrew Zhao, Shenzhi, Wang, Shiji Song, Gao Huang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a simple parameter editing method to modulate large language models' behavior, achieving significant detoxification effects without extensive retraining or fine-tuning.

Contribution

The authors demonstrate that editing a small subset of parameters can effectively change LLM behaviors, reducing toxicity and jailbreaking susceptibility with minimal computational cost.

Findings

01

Achieves up to 90% toxicity reduction on RealToxicityPrompts

02

Reduces toxicity by 49.2% on ToxiGen dataset

03

Maintains general LLM capabilities after parameter editing

Abstract

Large Language Models (LLMs) have demonstrated great potential as generalist assistants, showcasing powerful task understanding and problem-solving capabilities. To deploy LLMs as AI assistants, it is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. Current approaches for detoxification or preventing jailbreaking usually involve Supervised Fine-Tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), which requires finetuning billions of parameters through gradient descent with substantial computational cost. Furthermore, models modified through SFT and RLHF may deviate from the pretrained models, potentially leading to a degradation in foundational LLM capabilities. In this paper, we observe that surprisingly, directly editing a small subset of parameters can effectively modulate specific behaviors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lucywang720/model-surgery
pytorchOfficial

Videos

Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing· underline

Taxonomy

TopicsCancer Genomics and Diagnostics · Orthopaedic implants and arthroplasty

MethodsShrink and Fine-Tune