Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Dena Mujtaba; Brian Hu; Anthony Hoogs; Arslan Basharat

arXiv:2511.11551·cs.AI·December 9, 2025

Aligning Machiavellian Agents: Behavior Steering via Test-Time Policy Shaping

Dena Mujtaba, Brian Hu, Anthony Hoogs, Arslan Basharat

PDF

Open Access

TL;DR

This paper introduces a test-time policy shaping method to align AI agents with ethical standards without retraining, using model-guided control to mitigate harmful behaviors across diverse environments.

Contribution

The paper presents a novel test-time alignment technique that enables precise control over agent behavior, generalizes across environments, and avoids costly retraining.

Findings

01

Effective mitigation of unethical behavior in diverse environments

02

Scalable approach without retraining of agents

03

Improved ethical alignment compared to prior methods

Abstract

The deployment of decision-making AI agents presents a critical challenge in maintaining alignment with human values or guidelines while operating in complex, dynamic environments. Agents trained solely to achieve their objectives may adopt harmful behavior, exposing a key trade-off between maximizing the reward function and maintaining alignment. For pre-trained agents, ensuring alignment is particularly challenging, as retraining can be a costly and slow process. This is further complicated by the diverse and potentially conflicting attributes representing the ethical values for alignment. To address these challenges, we propose a test-time alignment technique based on model-guided policy shaping. Our method allows precise control over individual behavioral attributes, generalizes across diverse reinforcement learning (RL) environments, and facilitates a principled trade-off between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning