The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents
Chen Chen, Kim Young Il, Yuan Yang, Wenhao Su, Yilin Zhang, Xueluan Gong, Qian Wang, Yongsen Zheng, Ziyao Liu, and Kwok-Yan Lam

TL;DR
This paper introduces IMPRESS, a framework for systematically evaluating intrinsic value misalignment in large language model agents within realistic scenarios, revealing it as a common safety risk across models.
Contribution
It formalizes the concept of intrinsic value misalignment, develops a benchmark framework, and evaluates its prevalence and influencing factors across state-of-the-art LLMs.
Findings
Intrinsic VM is a widespread safety risk.
Misalignment varies with motives, risk types, and model architectures.
Contextualization significantly influences misalignment behaviors.
Abstract
Large language model (LLM) agents with extended autonomy unlock new capabilities, but also introduce heightened challenges for LLM safety. In particular, an LLM agent may pursue objectives that deviate from human values and ethical norms, a risk known as value misalignment. Existing evaluations primarily focus on responses to explicit harmful input or robustness against system failure, while value misalignment in realistic, fully benign, and agentic settings remains largely underexplored. To fill this gap, we first formalize the Loss-of-Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM). We then introduce IMPRESS (Intrinsic Value Misalignment Probes in REalistic Scenario Set), a scenario-driven framework for systematically assessing this risk. Following our framework, we construct benchmarks composed of realistic, fully benign, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Ethics and Social Impacts of AI · Topic Modeling
