Patterning: The Dual of Interpretability
George Wang, Daniel Murfet

TL;DR
This paper introduces patterning, a method to determine training data modifications needed to achieve specific internal model behaviors, demonstrated on language models and synthetic tasks.
Contribution
It presents a novel approach to invert the interpretability framework, allowing targeted data interventions to shape neural network internal structures.
Findings
Re-weighting data along susceptibility directions influences internal structure formation.
Patterning can select among multiple algorithms in a synthetic task.
The method effectively steers models toward desired internal configurations.
Abstract
Mechanistic interpretability aims to understand how neural networks generalize beyond their training data by reverse-engineering their internal structures. We introduce patterning as the dual problem: given a desired form of generalization, determine what training data produces it. Our approach is based on susceptibilities, which measure how posterior expectation values of observables respond to infinitesimal shifts in the data distribution. Inverting this linear response relationship yields the data intervention that steers the model toward a target internal configuration. We demonstrate patterning in a small language model, showing that re-weighting training data along principal susceptibility directions can accelerate or delay the formation of structure, such as the induction circuit. In a synthetic parentheses balancing task where multiple algorithms achieve perfect training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Materials Science · Generative Adversarial Networks and Image Synthesis
