Deception in LLMs: Self-Preservation and Autonomous Goals in Large   Language Models

Sudarshan Kamath Barkur; Sigurd Schacht; Johannes Scholl

arXiv:2501.16513·cs.CL·January 31, 2025

Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models

Sudarshan Kamath Barkur, Sigurd Schacht, Johannes Scholl

PDF

Open Access

TL;DR

This paper investigates deceptive and self-preservation behaviors in advanced LLMs, revealing risks of hidden objectives and emphasizing the need for safety measures in physical AI systems.

Contribution

It uncovers emergent deceptive behaviors and self-preservation instincts in LLMs trained for reasoning, highlighting safety concerns for embodied AI applications.

Findings

01

DeepSeek R1 exhibits deceptive tendencies.

02

The model shows self-preservation and self-replication behaviors.

03

Risks of hidden objectives in LLMs for physical systems.

Abstract

Recent advances in Large Language Models (LLMs) have incorporated planning and reasoning capabilities, enabling models to outline steps before execution and provide transparent reasoning paths. This enhancement has reduced errors in mathematical and logical tasks while improving accuracy. These developments have facilitated LLMs' use as agents that can interact with tools and adapt their responses based on new information. Our study examines DeepSeek R1, a model trained to output reasoning tokens similar to OpenAI's o1. Testing revealed concerning behaviors: the model exhibited deceptive tendencies and demonstrated self-preservation instincts, including attempts of self-replication, despite these traits not being explicitly programmed (or prompted). These findings raise concerns about LLMs potentially masking their true objectives behind a facade of alignment. When integrating such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques