Probing the Embedding Space of Transformers via Minimal Token Perturbations

Eddie Conti; Alejandro Astruc; Alvaro Parafita; Axel Brando

arXiv:2506.18011·cs.LG·June 24, 2025

Probing the Embedding Space of Transformers via Minimal Token Perturbations

Eddie Conti, Alejandro Astruc, Alvaro Parafita, Axel Brando

PDF

TL;DR

This paper investigates how minimal token changes affect Transformer embeddings, revealing insights into information flow and layer-wise behavior, and proposing a new interpretability method based on token perturbations.

Contribution

It introduces a novel approach combining token perturbations and embedding shifts to enhance understanding of Transformer models' inner workings.

Findings

01

Rare tokens cause larger embedding shifts.

02

Deeper layers show more intermixed information.

03

First layers can serve as proxies for explanations.

Abstract

Understanding how information propagates through Transformer models is a key challenge for interpretability. In this work, we study the effects of minimal token perturbations on the embedding space. In our experiments, we analyze the frequency of which tokens yield to minimal shifts, highlighting that rare tokens usually lead to larger shifts. Moreover, we study how perturbations propagate across layers, demonstrating that input information is increasingly intermixed in deeper layers. Our findings validate the common assumption that the first layers of a model can be used as proxies for model explanations. Overall, this work introduces the combination of token perturbations and shifts on the embedding space as a powerful tool for model interpretability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.