Training Neural Networks for Modularity aids Interpretability
Satvik Golechha, Dylan Cope, Nandi Schoots

TL;DR
This paper introduces an enmeshment loss to train neural networks with modular structures, enhancing interpretability by creating disjoint clusters that learn distinct, smaller circuits, demonstrated on CIFAR-10.
Contribution
The paper proposes a novel enmeshment loss function to promote modularity in neural networks, improving interpretability through disjoint clustering.
Findings
Clusters learned are disjoint and smaller.
Automated interpretability measures confirm clearer circuit separation.
Method effectively enhances model interpretability on CIFAR-10.
Abstract
An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an ``enmeshment loss'' function that encourages the formation of non-interacting clusters. Using automated interpretability measures, we show that our method finds clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a promising direction for making neural networks easier to interpret.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
