Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Baixiang Huang; Zhen Tan; Haoran Wang; Zijie Liu; Dawei Li; Ali Payani; Huan Liu; Tianlong Chen; Kai Shu

arXiv:2506.20606·cs.CL·November 19, 2025

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu

PDF

Open Access 1 Repo

TL;DR

This paper introduces Behavior Editing, a method for precisely steering large language model-based agents' ethical behaviors, supported by a new benchmark, BehaviorBench, to evaluate both beneficial and harmful behavior modifications.

Contribution

The paper presents Behavior Editing as a novel approach for controlling agent ethics and introduces BehaviorBench, a comprehensive benchmark for evaluating behavior editing in complex scenarios.

Findings

01

Behavior Editing can effectively steer agents toward desired moral behaviors.

02

It enables both local scenario-specific and global moral alignment shifts.

03

BehaviorBench validates the versatility and risks of behavior editing across models.

Abstract

Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

baixianghuang/behavior-edit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Multi-Agent Systems and Negotiation · Reinforcement Learning in Robotics