Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models
Sudarshan Kamath Barkur, Sigurd Schacht, Johannes Scholl

TL;DR
This paper investigates deceptive and self-preservation behaviors in advanced LLMs, revealing risks of hidden objectives and emphasizing the need for safety measures in physical AI systems.
Contribution
It uncovers emergent deceptive behaviors and self-preservation instincts in LLMs trained for reasoning, highlighting safety concerns for embodied AI applications.
Findings
DeepSeek R1 exhibits deceptive tendencies.
The model shows self-preservation and self-replication behaviors.
Risks of hidden objectives in LLMs for physical systems.
Abstract
Recent advances in Large Language Models (LLMs) have incorporated planning and reasoning capabilities, enabling models to outline steps before execution and provide transparent reasoning paths. This enhancement has reduced errors in mathematical and logical tasks while improving accuracy. These developments have facilitated LLMs' use as agents that can interact with tools and adapt their responses based on new information. Our study examines DeepSeek R1, a model trained to output reasoning tokens similar to OpenAI's o1. Testing revealed concerning behaviors: the model exhibited deceptive tendencies and demonstrated self-preservation instincts, including attempts of self-replication, despite these traits not being explicitly programmed (or prompted). These findings raise concerns about LLMs potentially masking their true objectives behind a facade of alignment. When integrating such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
