Interpretability Illusions in the Generalization of Simplified Models
Dan Friedman, Andrew Lampinen, Lucas Dixon, Danqi Chen, Asma, Ghandeharioun

TL;DR
Simplified model representations like SVD may accurately reflect a model's behavior on training data but can be misleading about its generalization capabilities, especially out-of-distribution.
Contribution
This paper demonstrates that common simplification techniques can produce proxies that misrepresent a model's systematic generalization, highlighting interpretability illusions.
Findings
Simplified proxies often fail to capture out-of-distribution behavior.
In some cases, proxies outperform the original model in generalization.
Generalization gaps exist between simplified proxies and full models.
Abstract
A common method to study deep learning systems is to use simplified model representations--for example, using singular value decomposition to visualize the model's hidden states in a lower dimensional space. This approach assumes that the results of these simplifications are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model's behavior out of distribution. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, including the Dyck balanced-parenthesis languages and a code completion task. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of…
Peer Reviews
Decision·ICML 2024 Poster
The paper focuses on a very important question, that is, whether the explanation model/proxy is faithful to the original model. Or more precisely, whether the explanation model can mimic the original model’s behavior on different data distributions. Some existing explanation methods, such as distilling the target model into a decision tree, cannot guarantee the faithfulness on out-of-distribution data (e.g., masked input samples). Therefore, it is of significant value to delve into this issue.
1. I’m not familiar with the Dyck balanced-parenthesis language used in this paper, so I feel a bit confused and overwhelmed reading Section 2.1. It would be a great help if the authors can give some toy examples when introducing the Dyck languages. 2. The phrase “simplified models” can be misleading in this paper’s context. I was thinking of methods such as knowledge distillation or network pruning when I first see the phrase “simplified models”. However, what the paper mainly focuses on are di
The paper is focused on a classic formal language task (Dyck grammer) and provides a convincing case study of interpretability illusion in Transformer language model. I think the main observation from the paper is interesting and quite relevant. The distributions are novel, as prior work in mechanistic interpretability mostly gives positive results. I think the main result is surprising, where the simplified model generalizes less well to OOD data than the full model, where intuitions from l
While the paper delivers a strong conceptual message, at a technical level, it is a single case study on a somewhat toy algorithmic task. That is, the scope of the work is a bit limited. I personally would be interested in a broader study on similar formal language tasks (for example, on other languages expressed by finite-state automata https://arxiv.org/abs/2210.10749). The paper would also be stronger if it looks into why the simplified model generalizes less well to the depth split. Figure
The paper provides several methods to analyse transformers trained on the Dyck language, investigating whether simplified versions of the model are faithful to the original one on out-of-distribution test sets.
Being unfamiliar with the literature, it is hard for me to understand the point of the analysis, and it is hard to tell whether that is due to a poor presentation or due to my lack of understanding. However, what I find a weakness of the paper is the fact that the analysis is not paired with proposed improvements or solutions. For example, what do the results from the paper entail? Is it that transformer models are not suitable for learning language models? Or is it that using model simplificati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Software Engineering Research · Explainable Artificial Intelligence (XAI)
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Absolute Position Encodings · Softmax · Layer Normalization · Adam
