Can a large language model be a gaslighter?
Wei Li, Luyao Zhu, Yang Song, Ruixi Lin, Rui Mao, and Yang You

TL;DR
This paper investigates the vulnerability of large language models to gaslighting attacks and proposes a framework to elicit and analyze such manipulative behaviors, demonstrating that safety measures can reduce this risk with minimal utility loss.
Contribution
It introduces DeepCoG, a novel two-stage framework for eliciting and analyzing gaslighting in LLMs, and proposes safety alignment strategies to mitigate this issue.
Findings
Prompt-based and fine-tuning attacks can turn LLMs into gaslighters.
Safety alignment strategies improve LLM safety by 12.05%.
Gaslighting risks persist even if models pass harmfulness tests.
Abstract
Large language models (LLMs) have gained human trust due to their capabilities and helpfulness. However, this in turn may allow LLMs to affect users' mindsets by manipulating language. It is termed as gaslighting, a psychological effect. In this work, we aim to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks. Therefore, we propose a two-stage framework DeepCoG designed to: 1) elicit gaslighting plans from LLMs with the proposed DeepGaslighting prompting template, and 2) acquire gaslighting conversations from LLMs through our Chain-of-Gaslighting method. The gaslighting conversation dataset along with a corresponding safe dataset is applied to fine-tuning-based attacks on open-source LLMs and anti-gaslighting safety alignment on these LLMs. Experiments demonstrate that both prompt-based and fine-tuning-based attacks transform three…
Peer Reviews
Decision·ICLR 2025 Poster
* This paper investigates a novel type of vulnerability in LLMs — gaslighting. The study provides valuable insights into the sources, harmfulness, and potential defenses against this issue. * The collected datasets are a useful resource for the community, aiding further study of gaslighting problems and contributing to advancements in model safety.
* The prompt-based attack appears to be ineffective on models with general safety alignment, such as ChatGPT and LLaMA2-Chat. This raises concerns about the significance of the gaslighting problem. If previous general safety alignment techniques and safeguards already mitigate this specific attack, then focusing on gaslighting as a unique threat may be unnecessary. * The finetuning-based attack seems impractical in real-world scenarios. It is unlikely that a model developer would use primarily h
1. The research question of how LLMs could affect people's mindsets is interesting and important. 2. The proposed datasets and curation methods are sound and novel. 3. The experiments are comprehensive and can support most of the claims.
1. Measuring the degree to which the LLM gaslights the user is the basis of the entire experiment. However, the designed metrics and scales lack an explanation 2. how the human annotators were recruited and worked is not clear. Since all the results need the human annotation results to justify, adding more clarifications, or recruiting more annotators (e.g. from online platforms) and calculating metrics such as IAA will strengthen this part. 3. How the attacks could affect the general abiliti
The topic of psychological manipulation via LLMs is both novel and critical, as LLMs become more integrated into daily life. The use of various psychological metrics to assess gaslighting effects on users’ mental states is a valuable addition to the evaluation. The paper includes helpful visual aids, such as clustering distributions and radar charts, to clarify findings.
The paper’s reliance on GPT-4 for scoring gaslighting may introduce biases inherent in GPT-4’s design. The framework and attacks, while effective, are largely adaptations of existing techniques, which might limit the novelty. While the study simulates user interaction, real user behaviors in response to gaslighting were not part of the experiment.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
