CELL your Model: Contrastive Explanations for Large Language Models
Ronny Luss, Erik Miehling, Amit Dhurandhar

TL;DR
This paper introduces a contrastive explanation method for large language models that explains their outputs by showing how slight prompt modifications lead to different responses, using a query-efficient algorithm.
Contribution
It proposes a novel contrastive explanation approach for LLMs that requires only black-box access and a scoring function, along with an efficient algorithm for generating contrasts within query limits.
Findings
Effective in explaining LLM responses in open-text generation
Applicable to chatbot conversation analysis
Uses a query-efficient contrast creation algorithm
Abstract
The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI, such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing a contrastive explanation method requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a scoring function that has meaning to the user and not necessarily a specific real valued quantity (viz. class label). To this end, we offer a novel budgeted…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The paper provides a clear formulation for finding the contrastive explanation. Such a clear formulation allows many search algorithms to be applied rather than simply relying on heuristics. I think the formulation also has the potential to be leveraged in other setups. 2. The method part of the paper is well-written. Also, it's commendable that the paper takes the search budget into account by proposing CELL-budget. From my perspective, designing algorithms to jointly optimize performance an
1. One major weakness of this paper is that the setup and takeaways for the Experiments section is not very clear. The major question I have is how to define the success of finding contrastive explanations. In Section 5.1, the results use the scoring function preferences, which characterize the pairwise preference between the given explanation and the baseline explanation generated by LLM through direct prompting. However, given the pairwise comparison model (i.e., stanfordnlp/SteamSHP-flan-t5-x
The paper is easy to follow. The proposed methods are simple and straightforward for generating contrastive explanations for LLMs.
- The novelty and contribution are not prominent. The authors claim that “this paper offers the first contrastive explanation methods for LLMs.” However, [1] has already proposed counterfactual explanations for LLMs, which are conceptually similar to contrastive explanations. This prior work is notably absent from the related literature review. While the authors emphasize that many previous studies focus on using LLMs to generate contrastive explanations for classification tasks, I do not see si
1. It provides a practical framework for explaining LLM outputs through contrastive examples, offering a way to understand why models generate specific responses by showing how small input changes lead to different outputs. 2. The proposed CELL-budget algorithm demonstrates improved efficiency by intelligently managing model queries, making it valuable for working with longer texts while maintaining performance quality. 3. The framework is highly flexible, supporting multiple scoring functions
1. The framework for generating contrastive explanations lacks clear definitional boundaries. For instance, in Figure 2's example ("My car is making a weird noise when I accelerate. Can you help me diagnose the problem?"), there's insufficient clarity regarding the relative importance hierarchy among phrases like "car," "weird noise," "accelerate," etc. 2. A potential typographical error appears in Line 190, where "y1 contradicts y1" likely should read "y1 contradicts y2." 3. The examples presen
* The first work studying contrastive explanation for the generation case. * The algorithm sounds reasonable and the scoring function is clearly defined. * The authors study the computational budget, which is important in real use cases.
* My main concern is with the contrastive prompts found by CELL. While the authors report a 0.99 similarity score between the original and contrastive prompts, many examples in the paper seem semantically different. Using BERT embeddings to assess similarity may be problematic, as they primarily capture word overlap, and changing a word can lead to entirely different meanings. This raises questions about the use cases the author provided: 1. **Use case of automated red teaming:** A successful
1. The authors propose a innovative way to tackle explainability in blackbox gen AI models, in particular LLMs. 2. Futhermore, they propose two algorithms, the second which tackles optimal prompt perturbation under a budget, that present a novel approach contrastive explainations. 3. The authors showcase their method in two usecases that have important applicability ramifications. 4. The paper is well writen and well structured. The figures are clear and the methods are nicely examplified.
Main weaknesses: 1. The authors propose contrastive explanations for LLMs based on prompt perturbations. Whereas the authors nicely demonstrate use cases of their method, there are important factors that have not been systematically explored to evaluate the robustness and generalizability of this approach. This point is set aside the fact that the scoring function is use case-dependent and thus its formalization changes for every case. In particular, I am curious whether the authors find differ
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI)
