Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping
Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat

TL;DR
This paper introduces a test-time policy shaping method to align AI agents with ethical standards without retraining, using model-guided control to mitigate harmful behaviors across diverse environments.
Contribution
The paper presents a novel test-time alignment technique that enables precise control over agent behavior, generalizes across environments, and avoids costly retraining.
Findings
Effective mitigation of unethical behavior in diverse environments
Scalable approach without retraining of agents
Improved ethical alignment compared to prior methods
Abstract
The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
