Gradient Atoms: Unsupervised Discovery, Attribution and Steering of Model Behaviors via Sparse Decomposition of Training Gradients
J Rosser

TL;DR
Gradient Atoms is an unsupervised technique that decomposes training gradients into sparse components, revealing shared behaviors and enabling controllable model steering without predefined queries.
Contribution
It introduces an unsupervised gradient decomposition method that discovers interpretable model behaviors and provides a way to steer models by applying identified gradient atoms.
Findings
Discovered 500 gradient atoms capturing diverse behaviors
Atoms can be used to steer model outputs significantly
Method scales independently of the number of behaviors
Abstract
Training data attribution (TDA) methods ask which training documents are responsible for a model behavior. However, models often learn broad concepts shared across many examples. Moreover, existing TDA methods are supervised -- they require a predefined query behavior, then score every training document against it -- making them both expensive and unable to surface behaviors the user did not think to ask about. We present Gradient Atoms, an unsupervised method that decomposes per-document training gradients into sparse components ("atoms") via dictionary learning in a preconditioned eigenspace. Each atom captures a shared update direction induced by a cluster of functionally similar documents, directly recovering the collective structure that per-document methods do not address. Among 500 discovered atoms, the highest-coherence ones recover interpretable task-type behaviors -- refusal,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Domain Adaptation and Few-Shot Learning
