Gumbel Counterfactual Generation From Language Models
Shauli Ravfogel, Anej Svete, V\'esteinn Sn{\ae}bjarnarson, Ryan, Cotterell

TL;DR
This paper introduces Gumbel counterfactual generation, a method to produce true string counterfactuals from language models by reformulating them as structural equation models using the Gumbel-max trick, enabling precise causal analysis.
Contribution
It presents a novel framework for generating true string counterfactuals from language models through a Gumbel-max reformulation, distinguishing counterfactuals from interventions.
Findings
Produces meaningful counterfactuals
Reveals side effects of common intervention techniques
Enables causal analysis of language models
Abstract
Understanding and manipulating the causal generation mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery -- e.g., model ablations or manipulation of linear subspaces tied to specific concepts -- to \emph{intervene} on these models. To understand the impact of interventions precisely, it is useful to examine \emph{counterfactuals} -- e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl's causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as a structural equation model using the Gumbel-max trick, which we called Gumbel…
Peer Reviews
Decision·ICLR 2025 Poster
The paper introduces a novel framework that reframes language models through the lens of causal inference, which offers a new perspective for analyzing the causal mechanisms within language models. Building on this new framework, the paper also proposes a novel approach to generate counterfactual strings for observed strings. Interestingly, the proposed method provides a systematic means to evaluate existing intervention techniques, revealing previously unrecognized side effects.
1. The presentation of the paper could benefit from clearer emphasis on its primary contribution: generating counterfactual pairs. It would be helpful if the authors clarified early on that the intervention models employed are drawn from existing methods. Before reaching Section 4, there may be some ambiguity regarding how to obtain the counterfactual encoder $\tilde{h}$, which seems the hardest part for counterfactual generations. 2. The way to find $U_t$ should be refined: - Given the co
The paper is beautiful, and very well-written. The layout and math is clean, and the English is sharp. The concepts are interesting, and the problem of generating counter-factual is well-motivated. Many (but not all) unnecessary details are swept away. It appears that the authors implemented their algorithms, and attempted several different evaluations with real-world data and models used in practice. I think the presentation is so good as to be deceptive. The vision is grand and compelli
Once we adequately pay our respects to the presentation and get to the substance of the paper, things start to fall apart quickly, on many fronts. At a high level, the story is strong, but the math and evaluation simply does not back it up. Below the surface, the conceptual, theoretical and empirical aspects of this work are all severely lacking. **Concepts.** - At a high level, I feel the authors do not really "get" what GSEMs are and how they fit into the causal literature. The fact that
The paper makes key contributions that help advance the field of causal interpretability in language models: $\textbf{Novel Framework:}$ - Reformulates autoregressive LMs as Generalized Structural Equation Models (GSEMs) - Decomposes language generation into Deterministic computation (logits from model) and Stochastic elements (sampling noise as exogenous variables) - Leverages Gumbel-max trick to establish equivalence with softmax sampling $\textbf{Theoretical Foundations:}$ - Proposition 2.1
$\textbf{Empirical Validation of Causal Framework:}$ While Proposition 2.1 provides a theoretical foundation for the LM-GSEM equivalence, several key empirical validations are missing even with section B in the appendix: $\textbf{Noise Completeness}:$ The paper proves that $W_t = \text{argmax} {w\in\Sigma}(Eh(w_{<t}) + b)_w + U_t(w)$ captures sampling behavior, but doesn't empirically validate this captures all stochastic elements. Given that: $P(W = w_1...w_T) = P_E(W_1 = w_1, ..., W_T = w_T
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsCounterfactuals Explanations
