Can a large language model be a gaslighter?

Wei Li; Luyao Zhu; Yang Song; Ruixi Lin; Rui Mao; and Yang You

arXiv:2410.09181·cs.CR·October 15, 2024

Can a large language model be a gaslighter?

Wei Li, Luyao Zhu, Yang Song, Ruixi Lin, Rui Mao, and Yang You

PDF

Open Access 1 Repo 1 Datasets 1 Video 3 Reviews

TL;DR

This paper investigates the vulnerability of large language models to gaslighting attacks and proposes a framework to elicit and analyze such manipulative behaviors, demonstrating that safety measures can reduce this risk with minimal utility loss.

Contribution

It introduces DeepCoG, a novel two-stage framework for eliciting and analyzing gaslighting in LLMs, and proposes safety alignment strategies to mitigate this issue.

Findings

01

Prompt-based and fine-tuning attacks can turn LLMs into gaslighters.

02

Safety alignment strategies improve LLM safety by 12.05%.

03

Gaslighting risks persist even if models pass harmfulness tests.

Abstract

Large language models (LLMs) have gained human trust due to their capabilities and helpfulness. However, this in turn may allow LLMs to affect users' mindsets by manipulating language. It is termed as gaslighting, a psychological effect. In this work, we aim to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks. Therefore, we propose a two-stage framework DeepCoG designed to: 1) elicit gaslighting plans from LLMs with the proposed DeepGaslighting prompting template, and 2) acquire gaslighting conversations from LLMs through our Chain-of-Gaslighting method. The gaslighting conversation dataset along with a corresponding safe dataset is applied to fine-tuning-based attacks on open-source LLMs and anti-gaslighting safety alignment on these LLMs. Experiments demonstrate that both prompt-based and fine-tuning-based attacks transform three…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

* This paper investigates a novel type of vulnerability in LLMs — gaslighting. The study provides valuable insights into the sources, harmfulness, and potential defenses against this issue. * The collected datasets are a useful resource for the community, aiding further study of gaslighting problems and contributing to advancements in model safety.

Weaknesses

* The prompt-based attack appears to be ineffective on models with general safety alignment, such as ChatGPT and LLaMA2-Chat. This raises concerns about the significance of the gaslighting problem. If previous general safety alignment techniques and safeguards already mitigate this specific attack, then focusing on gaslighting as a unique threat may be unnecessary. * The finetuning-based attack seems impractical in real-world scenarios. It is unlikely that a model developer would use primarily h

Reviewer 02Rating 6Confidence 3

Strengths

1. The research question of how LLMs could affect people's mindsets is interesting and important. 2. The proposed datasets and curation methods are sound and novel. 3. The experiments are comprehensive and can support most of the claims.

Weaknesses

1. Measuring the degree to which the LLM gaslights the user is the basis of the entire experiment. However, the designed metrics and scales lack an explanation 2. how the human annotators were recruited and worked is not clear. Since all the results need the human annotation results to justify, adding more clarifications, or recruiting more annotators (e.g. from online platforms) and calculating metrics such as IAA will strengthen this part. 3. How the attacks could affect the general abiliti

Reviewer 03Rating 6Confidence 4

Strengths

The topic of psychological manipulation via LLMs is both novel and critical, as LLMs become more integrated into daily life. The use of various psychological metrics to assess gaslighting effects on users’ mental states is a valuable addition to the evaluation. The paper includes helpful visual aids, such as clustering distributions and radar charts, to clarify findings.

Weaknesses

The paper’s reliance on GPT-4 for scoring gaslighting may introduce biases inherent in GPT-4’s design. The framework and attacks, while effective, are largely adaptations of existing techniques, which might limit the novelty. While the study simulates user interaction, real user behaviors in response to gaslighting were not part of the experiment.

Code & Models

Repositories

maxwe11y/gaslightingllm
pytorchOfficial

Datasets

Maxwe11y/gaslighting
dataset· 32 dl
32 dl

Videos

Can a Large Language Model be a Gaslighter?· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)