Counterfactual Explainable Incremental Prompt Attack Analysis on Large   Language Models

Dong Shu; Mingyu Jin; Tianle Chen; Chong Zhang; Yongfeng Zhang

arXiv:2407.09292·cs.CR·July 18, 2024

Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models

Dong Shu, Mingyu Jin, Tianle Chen, Chong Zhang, Yongfeng Zhang

PDF

Open Access

TL;DR

This paper introduces CEIPA, a novel explainable method to analyze and improve the robustness of large language models against prompt-based attacks by incrementally modifying prompts at multiple levels.

Contribution

We propose CEIPA, a new incremental counterfactual approach that elucidates LLM vulnerabilities and enhances attack effectiveness through structured prompt modifications.

Findings

01

CEIPA effectively reveals LLM vulnerabilities to prompt attacks.

02

Incremental prompt modifications improve attack success rates.

03

The framework provides counterfactual explanations for harmful response generation.

Abstract

This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs), such as GPT-4 and LLaMA-2, by identifying and mitigating their vulnerabilities through explainable analysis of prompt attacks. We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness and explore the embedded defense mechanisms in these models. Our approach is distinctive for its capacity to elucidate the reasons behind the generation of harmful responses by LLMs through an incremental counterfactual methodology. By organizing the prompt modification process into four incremental levels: (word, sentence, character, and a combination of character and word) we facilitate a thorough examination of the susceptibilities inherent to LLMs. The findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling