TL;DR
TRACE is a modular toolkit that enables real-time interpretability analysis of transformer language models, revealing developmental linguistic phenomena and structural insights during training.
Contribution
It introduces a lightweight, in-training interpretability toolkit that integrates with synthetic data generators for comprehensive analysis of language model development.
Findings
Reveals early syntactic emergence in training
Detects delayed semantic acquisition
Identifies representational compression phenomena
Abstract
Understanding when and how linguistic knowledge emerges during language model training remains a central challenge for interpretability. Most existing tools are post hoc, rely on scalar metrics, or require nontrivial integration effort, making comprehensive interpretability analysis difficult to deploy and maintain. We introduce TRACE, a modular toolkit for training and inference-time interpretability analysis of transformer models. It enables lightweight, in-training analysis of linguistic and representational signals, including features probing, intrinsic dimensionality, Hessian curvature, and output diagnostics. It integrates with ABSynth, a controllable synthetic corpus generator that provides structured annotations for precise evaluation of linguistic feature acquisition. Experiments with autoregressive transformers demonstrate that TRACE reveals developmental phenomena such as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
