Towards a unified and verified understanding of group-operation networks

Wilson Wu; Louis Jaburi; Jacob Drori; Jason Gross

arXiv:2410.07476·cs.LG·January 28, 2025

Towards a unified and verified understanding of group-operation networks

Wilson Wu, Louis Jaburi, Jacob Drori, Jason Gross

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper uncovers new internal structures of neural networks trained on finite group operations, providing a unified explanation that improves understanding and verification of model performance, especially for symmetric groups.

Contribution

It introduces a comprehensive explanation of neural networks trained on group operations, unifying previous approaches and verifying model internals with improved efficiency and accuracy guarantees.

Findings

01

Models approximate equivariance in each input argument.

02

Explanation yields 3x faster accuracy guarantees than brute force.

03

Achieves >=95% accuracy bound for 45% of trained models.

Abstract

A recent line of work in mechanistic interpretability has focused on reverse-engineering the computation performed by neural networks trained on the binary operation of finite groups. We investigate the internals of one-hidden-layer neural networks trained on this task, revealing previously unidentified structure and producing a more complete description of such models in a step towards unifying the explanations of previous works (Chughtai et al., 2023; Stander et al., 2024). Notably, these models approximate equivariance in each input argument. We verify that our explanation applies to a large fraction of networks trained on this task by translating it into a compact proof of model performance, a quantitative evaluation of the extent to which we faithfully and concisely explain model internals. In the main text, we focus on the symmetric group S5. For models trained on this group, our…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 6Confidence 3

Strengths

1. Compact proofs are a new way of supporting model interpretations and it seems like they could be interesting, since as the paper states, valid compact proofs can be generated from interpretations one is certain of. 2. For the half of the models that the rho-set interpretation works for, explaining how to reconcile the irrep sparsity and cosets interpretation is helpful.

Weaknesses

1. While compact proofs are an interesting way to approach interpretability, it's unclear whether they could be used to help interpret neural network solutions for datasets where no or limited explicit information is known about the distribution it was sampled from (e.g. any language task, CIFAR-10, etc.). 2. The fact that the compact proofs derived in this work only get approximately a 50% success rate is concerning, as it implies that the framework using rho-sets is possibly not general enough

Reviewer 02Rating 8Confidence 4

Strengths

1. The authors have found an elegant yet highly non-obvious algorithm for group composition in a single ReLU layer, that multiple prior papers missed. This is a valuable contribution to the literature. - Further, the provided explanation clarifies and adds useful context to observations made in prior work 2. The presented compact proofs are highly detailed and rigorous, and actually explicitly go through every key detail, rather than hand-waving annoying points. 3. It demonstrates compact pr

Weaknesses

1. This is only studied on $S_5$ 2. The compact proof is only 3x faster than brute force 3. The paper is highly technical and at times quite difficult to follow, especially as it builds deeply on 3 prior papers! Though the authors have clearly made an effort to be clear, and this is an inherently complex work. This took me significantly longer than other reviews. 4. The link between finding a compact proof of a bound on accuracy, and verifying a mechanistic explanation, seems somewhat unclear

Reviewer 03Rating 8Confidence 3

Strengths

- The authors successfully reverse-engineer a neural network trained to perform group operations and provide a more complete explanation than prior work. - They rigorously evaluate the quality of their explanation, highlighting that it only explains a subset of solutions a model with this architecture might learn in practice. - Their evaluation exposes limitations of causal interventions as positive evidence of explanations.

Weaknesses

Overall, I think this is a solid contribution without significant weaknesses.

Videos

Towards a Unified and Verified Understanding of Group-Operation Networks· slideslive

Taxonomy

TopicsComplex Systems and Decision Making · Logic, Reasoning, and Knowledge · Semantic Web and Ontologies

MethodsFocus