Perturbation: A simple and efficient adversarial tracer for representation learning in language models
Joshua Rozner, Cory Shain

TL;DR
This paper introduces a simple adversarial perturbation method that fine-tunes language models on a single example to reveal how representations transfer across examples, uncovering structured linguistic abstractions without geometric assumptions.
Contribution
It proposes a novel perturbation-based approach to analyze linguistic representations in language models, avoiding geometric assumptions and revealing structured transfer of information.
Findings
Perturbation reveals structured transfer at multiple linguistic levels.
Language models generalize along representational lines.
Models acquire linguistic abstractions from training data.
Abstract
Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Topic Modeling · Natural Language Processing Techniques
