Probing the Embedding Space of Transformers via Minimal Token Perturbations
Eddie Conti, Alejandro Astruc, Alvaro Parafita, Axel Brando

TL;DR
This paper investigates how minimal token changes affect Transformer embeddings, revealing insights into information flow and layer-wise behavior, and proposing a new interpretability method based on token perturbations.
Contribution
It introduces a novel approach combining token perturbations and embedding shifts to enhance understanding of Transformer models' inner workings.
Findings
Rare tokens cause larger embedding shifts.
Deeper layers show more intermixed information.
First layers can serve as proxies for explanations.
Abstract
Understanding how information propagates through Transformer models is a key challenge for interpretability. In this work, we study the effects of minimal token perturbations on the embedding space. In our experiments, we analyze the frequency of which tokens yield to minimal shifts, highlighting that rare tokens usually lead to larger shifts. Moreover, we study how perturbations propagate across layers, demonstrating that input information is increasingly intermixed in deeper layers. Our findings validate the common assumption that the first layers of a model can be used as proxies for model explanations. Overall, this work introduces the combination of token perturbations and shifts on the embedding space as a powerful tool for model interpretability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
