Interpretability in Activation Space Analysis of Transformers: A Focused Survey
Soniya Vijayakumar

TL;DR
This survey reviews interpretability methods focusing on the activation space of feed-forward layers in transformers, highlighting its under-explored role and suggesting future research directions.
Contribution
It provides a comprehensive review of interpretability techniques for activation space analysis in transformer feed-forward layers, an area with limited prior research.
Findings
Activation space analysis is under-explored in transformer interpretability.
The survey identifies key methods and gaps in current research.
Future directions for activation space interpretability are proposed.
Abstract
The field of natural language processing has reached breakthroughs with the advent of transformers. They have remained state-of-the-art since then, and there also has been much research in analyzing, interpreting, and evaluating the attention layers and the underlying embedding space. In addition to the self-attention layers, the feed-forward layers in the transformer are a prominent architectural component. From extensive research, we observe that its role is under-explored. We focus on the latent space, known as the Activation Space, that consists of the neuron activations from these feed-forward layers. In this survey paper, we review interpretability methods that examine the learnings that occurred in this activation space. Since there exists only limited research in this direction, we conduct a detailed examination of each work and point out potential future directions of research.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Neural Networks and Applications
