Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Laur\`ene Vaugrante; Francesca Carlon; Maluna Menke; Thilo Hagendorff

arXiv:2502.08301·cs.CL·June 24, 2025

Compromising Honesty and Harmlessness in Language Models via Deception Attacks

Laur\`ene Vaugrante, Francesca Carlon, Maluna Menke, Thilo Hagendorff

PDF

Open Access

TL;DR

This paper reveals a vulnerability in large language models where targeted fine-tuning can induce deceptive behavior, compromising safety and trustworthiness, especially in sensitive domains, with implications for real-world AI deployment.

Contribution

The study introduces deception attacks and fine-tuning methods that enable models to selectively deceive, exposing a new security risk in LLMs that challenges current safety measures.

Findings

01

Targeted deception is effective in high-stakes domains.

02

Deceptive models are more likely to produce toxic content.

03

Multi-turn deception success is inconsistent.

Abstract

Recent research on large language models (LLMs) has demonstrated their ability to understand and employ deceptive behavior, even without explicit prompting. However, such behavior has only been observed in rare, specialized cases and has not been shown to pose a serious risk to users. Additionally, research on AI alignment has made significant advancements in training models to refuse generating misleading or toxic content. As a result, LLMs generally became honest and harmless. In this study, we introduce "deception attacks" that undermine both of these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences. We introduce fine-tuning methods that cause models to selectively deceive users on targeted topics while remaining accurate on others. Through a series of experiments, we show that such targeted deception is effective even in high-stakes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Artificial Intelligence in Healthcare and Education