Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space
Alon Bebchuk, Nir Shavit

TL;DR
This paper introduces a toy interpretability model revealing that lottery tickets correspond to specific feature-space code locations, emphasizing the importance of feature-space geometry over weight-space subnetwork identity.
Contribution
It demonstrates that winning tickets are linked to initial feature-space locations near final codes, highlighting the role of feature-space geometry in lottery ticket phenomena.
Findings
Winning tickets correspond to precursor locations in feature space.
Proximal locations either converge to final codes or are rejected.
Feature-space probes outperform weight-based methods in code recovery.
Abstract
The lottery ticket hypothesis posits that dense networks contain sparse subnetworks, ``winning tickets,'' that, when rewound to their initial weights and retrained in isolation, match the performance of the full model. We ask a more mechanistic question: what internal object does a winning ticket preserve? We work in a combinatorial, clause-structured toy setting that admits an interpretable feature-space representation with well-defined combinatorial distances between features. We show that winning tickets in weight space correspond to precursor locations in feature space that are already near, at initialization, to the final feature-channel codes. Dense SGD resolves these locations through structured selection: proximal locations either converge to final codes or are rejected, with rejection concentrated at more crowded neurons, implicating competition under superposition. A winning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
