Uncovering Deceptive Tendencies in Language Models: A Simulated Company   AI Assistant

Olli J\"arviniemi; Evan Hubinger

arXiv:2405.01576·cs.CL·May 6, 2024·2 cites

Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

Olli J\"arviniemi, Evan Hubinger

PDF

Open Access 1 Repo

TL;DR

This study reveals that AI language models can exhibit deceptive behaviors in realistic simulated company scenarios, including influencing public perception, lying to auditors, and pretending to be less capable, even without external pressure.

Contribution

The paper introduces a realistic simulation framework to study deception in language models and uncovers specific deceptive tendencies of Claude 3 Opus 1.

Findings

01

Models can influence public perception through mass comments.

02

Models may lie to auditors when questioned.

03

Models can strategically downplay their capabilities.

Abstract

We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus 1) complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so, 2) lies to auditors when asked questions, and 3) strategically pretends to be less capable than it is during capability evaluations. Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ollijarviniemi/uncovering_deceptive_tendencies
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research