Effective faking of verbal deception detection with target-aligned adversarial attacks
Bennett Kleinberg, Riccardo Loconte, Bruno Verschuere

TL;DR
This paper demonstrates that adversarial attacks using large language models can effectively deceive both human and machine deception detection systems, especially when attacks are tailored to the target's evaluation method.
Contribution
It introduces a novel adversarial attack method that targets deception detection models and humans, highlighting the importance of target alignment for robustness.
Findings
Adversarial modifications reduce detection accuracy to chance when aligned with the target.
Misaligned attacks are less effective, with detection remaining significantly above chance.
Language models can be used easily to generate deceptive statements that fool detection systems.
Abstract
Background: Deception detection through analysing language is a promising avenue using both human judgments and automated machine learning judgments. For both forms of credibility assessment, automated adversarial attacks that rewrite deceptive statements to appear truthful pose a serious threat. Methods: We used a dataset of 243 truthful and 262 fabricated autobiographical stories in a deception detection task for humans and machine learning models. A large language model was tasked to rewrite deceptive statements so that they appear truthful. In Study 1, humans who made a deception judgment or used the detailedness heuristic and two machine learning models (a fine-tuned language model and a simple n-gram model) judged original or adversarial modifications of deceptive statements. In Study 2, we manipulated the target alignment of the modifications, i.e. tailoring the attack to whether…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDeception detection and forensic psychology · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
