Can Interpretation Predict Behavior on Unseen Data?
Victoria R. Li, Jenny Kaufmann, Martin Wattenberg, David Alvarez-Melis, Naomi Saphra

TL;DR
This paper investigates whether interpretability tools, specifically attention patterns in Transformer models, can predict out-of-distribution generalization, showing promising correlations between attention hierarchies and OOD performance.
Contribution
It demonstrates that simple interpretability metrics can forecast OOD model behavior, highlighting the potential of interpretability for unseen data prediction.
Findings
Hierarchical attention patterns correlate with hierarchical OOD generalization
Attention analysis can predict OOD performance even without explicit hierarchical rules
Transformers exhibit diverse systematic generalization rules in OOD settings
Abstract
Interpretability research often aims to predict how a model will respond to targeted interventions on specific mechanisms. However, it rarely predicts how a model will respond to unseen input data. This paper explores the promises and challenges of interpretability as a tool for predicting out-of-distribution (OOD) model behavior. Specifically, we investigate the correspondence between attention patterns and OOD generalization in hundreds of Transformer models independently trained on a synthetic classification task. These models exhibit several distinct systematic generalization rules OOD, forming a diverse population for correlational analysis. In this setting, we find that simple observational tools from interpretability can predict OOD performance. In particular, when in-distribution attention exhibits hierarchical patterns, the model is likely to generalize hierarchically on OOD…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper has a great presentation. The Figures are of high quality, the text is easy to read, and the experiments are well explained. The paper tackles an important challenge: whether we can use attention heads to predict how a Transformer will act "in the wild". Notably, the fact that attention ablation has no effect In-Distribution, but can have unpredictable behaviors Out-of-Distribution is an important result. It warns explainability researchers that only evaluating explainability methods
## Incomplete Related Work While Section 2 is a great read, I think it is too high-level for ICLR. I would rather motivate the current work by having a Related Work section that discusses in more depth the papers from the introduction. It would be interesting to describe what is activation steering, activation patching, and Sparse Autoencoder (SAE), and their limitations when it comes to OOD data. For instance, the work of (Kisanne et al. 2024, Smith et al. 2025) (line 48 of the manuscript) foc
1. Using interpretability to analyze the model behavior could be an interesting topic. 2. The authors provide the dataset and code with detailed experimental settings for good reproducibility.
1. The title can be misleading. The authors mainly look at the OOD generalization, and only test the miniGPT type transformer model on a specific synthetic dataset. To me, it is not appropriate to use a general phrase “model behavior”. 2. Also, with a transformer-based architecture with different hyperparameters that can influence the OOD generalization, I think it shouldn’t have been stated “hundreds of models”. 3. All the results presented in the paper rely heavily on specific synthetic datase
I believe the experiments are interesting and may be sound (difficult to evaluate given the poor presentation). Section 4, with the experiments and results around vestigial circuits and factors in rule-selection, is interesting and readable. Section 5 (also well-named) also reads better than Sections 1-3, but it was a struggle to understand the problem framing and experimental setting, so it is difficult for me to comment on soundness. The observation that the common setting of ablating propose
MAJOR: The paper is needlessly hard to follow and feels vague in many places. It makes for a frustrated reader. Many of the questions I have about this paper are probably due to the poor exposition of the motivation, concepts, and experimental setting. The paper flip-flops awkwardly between tedious details and broad, vague conceptual or epistemic statements, making it difficult to follow, evaluate, or build on the work presented. MAJOR: The paper is missing precise statements to guide the reade
1. Good and ambitious motivation. 2. Although it uses a toy setup, it includes many models and experiments, and some findings are interesting. 3. The paper is well written, with accurate, clear descriptions and excellent details.
1. The experimental method relies solely on a simplified parentheses-balancing task. This narrow setup may limit the generality of the conclusions. 2. While the findings (e.g., “independently trained models cluster around systematic generalization rules”) are interesting, the paper would benefit from demonstrating at least one concrete example or use case that shows how such findings could be useful in practical applications or improvements for model designs. 3. The dataset used in evaluation is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Healthcare · Adversarial Robustness in Machine Learning
