Counterfactual Explainable Incremental Prompt Attack Analysis on Large Language Models
Dong Shu, Mingyu Jin, Tianle Chen, Chong Zhang, Yongfeng Zhang

TL;DR
This paper introduces CEIPA, a novel explainable method to analyze and improve the robustness of large language models against prompt-based attacks by incrementally modifying prompts at multiple levels.
Contribution
We propose CEIPA, a new incremental counterfactual approach that elucidates LLM vulnerabilities and enhances attack effectiveness through structured prompt modifications.
Findings
CEIPA effectively reveals LLM vulnerabilities to prompt attacks.
Incremental prompt modifications improve attack success rates.
The framework provides counterfactual explanations for harmful response generation.
Abstract
This study sheds light on the imperative need to bolster safety and privacy measures in large language models (LLMs), such as GPT-4 and LLaMA-2, by identifying and mitigating their vulnerabilities through explainable analysis of prompt attacks. We propose Counterfactual Explainable Incremental Prompt Attack (CEIPA), a novel technique where we guide prompts in a specific manner to quantitatively measure attack effectiveness and explore the embedded defense mechanisms in these models. Our approach is distinctive for its capacity to elucidate the reasons behind the generation of harmful responses by LLMs through an incremental counterfactual methodology. By organizing the prompt modification process into four incremental levels: (word, sentence, character, and a combination of character and word) we facilitate a thorough examination of the susceptibilities inherent to LLMs. The findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
