Prompt-Counterfactual Explanations for Generative AI System Behavior

Sofie Goethals; Foster Provost; Jo\~ao Sedoc

arXiv:2601.03156·cs.LG·January 28, 2026

Prompt-Counterfactual Explanations for Generative AI System Behavior

Sofie Goethals, Foster Provost, Jo\~ao Sedoc

PDF

Open Access

TL;DR

This paper introduces a novel framework and algorithm for generating prompt-counterfactual explanations to interpret and control the output characteristics of generative AI systems, enhancing transparency and safety.

Contribution

It adapts counterfactual explanation techniques to non-deterministic generative AI, providing a new method for understanding and mitigating undesirable output traits.

Findings

01

PCEs can identify prompts leading to toxic or biased outputs

02

Framework improves prompt engineering for safer AI outputs

03

Case studies demonstrate effectiveness across multiple output characteristics

Abstract

As generative AI systems become integrated into real-world applications, organizations increasingly need to be able to understand and interpret their behavior. In particular, decision-makers need to understand what causes generative AI systems to exhibit specific output characteristics. Within this general topic, this paper examines a key question: what is it about the input -- the prompt -- that causes an LLM-based generative AI system to produce output that exhibits specific characteristics, such as toxicity, negative sentiment, or political bias. To examine this question, we adapt a common technique from the Explainable AI literature: counterfactual explanations. We explain why traditional counterfactual explanations cannot be applied directly to generative AI systems, due to several differences in how generative AI systems function. We then propose a flexible framework that adapts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Artificial Intelligence in Healthcare and Education