Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm
Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu

TL;DR
This paper introduces Behavior Editing, a method for precisely steering large language model-based agents' ethical behaviors, supported by a new benchmark, BehaviorBench, to evaluate both beneficial and harmful behavior modifications.
Contribution
The paper presents Behavior Editing as a novel approach for controlling agent ethics and introduces BehaviorBench, a comprehensive benchmark for evaluating behavior editing in complex scenarios.
Findings
Behavior Editing can effectively steer agents toward desired moral behaviors.
It enables both local scenario-specific and global moral alignment shifts.
BehaviorBench validates the versatility and risks of behavior editing across models.
Abstract
Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Multi-Agent Systems and Negotiation · Reinforcement Learning in Robotics
