Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe
Gaofei Shen, Martijn Bentum, Tom Lentz, Afra Alishahi, Grzegorz Chrupa{\l}a

TL;DR
This paper introduces an Encoding Probe method to reconstruct language model internal representations using interpretable features, addressing limitations of traditional decoding probes.
Contribution
The paper proposes a novel encoding probe approach that enables direct comparison of feature contributions and accounts for feature correlations in model representations.
Findings
Speaker effects vary across training objectives and datasets.
Syntactic and lexical features contribute independently to reconstruction.
Encoding Probe offers a complementary perspective to decoding methods.
Abstract
Probing is widely used to study which features can be decoded from language model representations. However, the common decoding probe approach has two limitations that we aim to solve with our new encoding probe approach: contributions of different features to model representations cannot be directly compared, and feature correlations can affect probing results. We present an Encoding Probe that reverses this direction and reconstructs internal representations of models using interpretable features. We evaluate this method on text and speech transformer models, using feature sets spanning acoustics, phonetics, syntax, lexicon, and speaker identity. Our results suggest that speaker-related effects vary strongly across different training objectives and datasets, while syntactic and lexical features contribute independently to reconstruction. These results show that the Encoding Probe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
