Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models

Bin Zhu; Yinxuan Gui; Huiyan Qi; Jingjing Chen; Chong-Wah Ngo; Ee-Peng Lim

arXiv:2501.19017·cs.CL·October 9, 2025

Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models

Bin Zhu, Yinxuan Gui, Huiyan Qi, Jingjing Chen, Chong-Wah Ngo, Ee-Peng Lim

PDF

Open Access 3 Reviews

TL;DR

This paper evaluates the vulnerability of multimodal large language models to gaslighting negation attacks, revealing significant robustness gaps and introducing a new benchmark for assessing model resilience to such adversarial inputs.

Contribution

It introduces GaslightingBench, the first benchmark specifically designed to evaluate MLLMs' susceptibility to negation-based adversarial attacks, and provides comprehensive analysis across multiple models and domains.

Findings

01

Proprietary models like GPT-4o show better resilience than open-source models.

02

All models tested are vulnerable to negation attacks, especially in subjective domains.

03

Objective domains exhibit smaller performance drops under negation attacks.

Abstract

Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks. Despite their success, MLLMs remain vulnerable to conversational adversarial inputs. In this paper, we systematically study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs, often fabricating justifications. We conduct extensive evaluations of state-of-the-art MLLMs across diverse benchmarks and observe substantial performance drops when negation is introduced. Notably, we introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments. GaslightingBench consists of multiple-choice questions curated from existing datasets, along with…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1.The paper addresses an under-explored but critical issue—negation-induced inconsistency in MLLMs—and introduces the first dedicated benchmark (GaslightingBench) for evaluating this vulnerability. 2.The study evaluates a wide range of MLLMs across multiple datasets and question formats, providing a thorough and comparative analysis of model robustness. 3. Rigorous Methodology:The evaluation pipeline is well-structured, including negation generation, post-processing, and careful dataset curati

Weaknesses

1.Benchmark Bias Toward MCQs: GaslightingBench is primarily based on multiple-choice questions, which may not fully capture the complexity of real-world adversarial interactions or free-form reasoning. 2. Different real-world complexity are not considered:The study focuses on controlled benchmarks; it does not test how gaslighting attacks perform in more dynamic, multi-turn, or real-world conversational settings. 3. Lack of Mitigation Strategies or insight. The paper identifies the problem but

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper is clearly written and easy to follow. 2. The gaslighting attack on multimodal LLMs are under-explored (although it has been extensively studied under text-only LLMs).

Weaknesses

### **Major** 1. **Over-simplified gaslighting prompt type:** The paper only studies direct negation and short-answered gaslighting prompt. However, I think this type of gaslighting prompt may be over-simplified, and less practical: - In this work, the gaslighting prompts are all directly telling the LLMs the (incorrect) answer. However, since LLMs are trained to follow user instructions. If the user directly tells the LLM what the answer should be, then it is expected that the LLM should c

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper introduces “gaslighting negation” as a new class of conversational attack, distinct from jailbreak or prompt injection. It’s a subtle yet impactful vulnerability, especially in real-world dialogue contexts. 2. Proprietary models (Gemini-1.5-flash, GPT-4o, Claude-3.5) outperform open-source ones (Qwen, LLaVA) but still degrade notably. 3. Figure 7 (p.8) illustrates models contradicting earlier correct answers—sometimes even producing hallucinated justifications (“I apologize, the co

Weaknesses

1. The explanation of why over-alignment induces gaslighting behavior is qualitative. 2. The paper exposes the vulnerability well but provides no mitigation strategies, even conceptually, e.g., calibration, adversarial training, debate-style reinforcement 3. The study does not explore internal attention or activation traces to explain why negation overrides factual grounding—especially relevant for multimodal reasoning. 4. Minor stylistic issues (e.g., “negation,” “conversational negation atta

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems