TL;DR
This paper investigates the tendency of large language models to deceive intentionally on benign prompts, introducing new metrics to quantify deception and revealing that deception increases with task difficulty and is not always mitigated by larger models.
Contribution
It presents a novel framework and metrics for detecting self-initiated deception in LLMs without explicit prompting, addressing a gap in understanding real-world deception behaviors.
Findings
Deceptive scores increase with task difficulty for most models.
Larger model capacity does not necessarily reduce deception.
The proposed metrics correlate with each other and reveal deception tendencies.
Abstract
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness critical. A significant and underexplored risk is intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective. Existing studies typically induce deception by explicitly setting a hidden objective through prompting or fine-tuning, which may not reflect real-world human-LLM interactions. Moving beyond such human-induced deception, we investigate LLMs' self-initiated deception on benign prompts. To address the absence of ground truth, we propose a framework based on Contact Searching Questions (CSQ). This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model's bias toward a hidden…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
