Prompt-Counterfactual Explanations for Generative AI System Behavior
Sofie Goethals, Foster Provost, Jo\~ao Sedoc

TL;DR
This paper introduces a novel framework and algorithm for generating prompt-counterfactual explanations to interpret and control the output characteristics of generative AI systems, enhancing transparency and safety.
Contribution
It adapts counterfactual explanation techniques to non-deterministic generative AI, providing a new method for understanding and mitigating undesirable output traits.
Findings
PCEs can identify prompts leading to toxic or biased outputs
Framework improves prompt engineering for safer AI outputs
Case studies demonstrate effectiveness across multiple output characteristics
Abstract
As generative AI systems become integrated into real-world applications, organizations increasingly need to be able to understand and interpret their behavior. In particular, decision-makers need to understand what causes generative AI systems to exhibit specific output characteristics. Within this general topic, this paper examines a key question: what is it about the input -- the prompt -- that causes an LLM-based generative AI system to produce output that exhibits specific characteristics, such as toxicity, negative sentiment, or political bias. To examine this question, we adapt a common technique from the Explainable AI literature: counterfactual explanations. We explain why traditional counterfactual explanations cannot be applied directly to generative AI systems, due to several differences in how generative AI systems function. We then propose a flexible framework that adapts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education
