Training Neural Networks for Modularity aids Interpretability

Satvik Golechha; Dylan Cope; Nandi Schoots

arXiv:2409.15747·cs.LG·July 29, 2025

Training Neural Networks for Modularity aids Interpretability

Satvik Golechha, Dylan Cope, Nandi Schoots

PDF

Open Access

TL;DR

This paper introduces an enmeshment loss to train neural networks with modular structures, enhancing interpretability by creating disjoint clusters that learn distinct, smaller circuits, demonstrated on CIFAR-10.

Contribution

The paper proposes a novel enmeshment loss function to promote modularity in neural networks, improving interpretability through disjoint clustering.

Findings

01

Clusters learned are disjoint and smaller.

02

Automated interpretability measures confirm clearer circuit separation.

03

Method effectively enhances model interpretability on CIFAR-10.

Abstract

An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an ``enmeshment loss'' function that encourages the formation of non-interacting clusters. Using automated interpretability measures, we show that our method finds clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a promising direction for making neural networks easier to interpret.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques