Equivariant Neural Functional Networks for Transformers

Viet-Hoang Tran; Thieu N. Vo; An Nguyen The; Tho Tran Huu; Minh-Khoi; Nguyen-Nhat; Thanh Tran; Duy-Tung Pham; Tan Minh Nguyen

arXiv:2410.04209·cs.LG·March 10, 2025

Equivariant Neural Functional Networks for Transformers

Viet-Hoang Tran, Thieu N. Vo, An Nguyen The, Tho Tran Huu, Minh-Khoi, Nguyen-Nhat, Thanh Tran, Duy-Tung Pham, Tan Minh Nguyen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a systematic approach to designing neural functional networks for transformers, establishing their symmetry properties, and providing a new benchmark dataset for evaluating such models.

Contribution

It develops the first equivariant NFN for transformers, analyzes their symmetry groups, and releases a large dataset of transformer checkpoints for benchmarking.

Findings

01

Transformer-NFN is equivariant under identified group actions.

02

A new dataset of 125,000 transformer checkpoints is provided.

03

Guidelines for NFN design in transformers are established.

Abstract

This paper systematically explores neural functional networks (NFN) for transformer architectures. NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data and have proven valuable for tasks such as learnable optimizers, implicit data representations, and weight editing. While NFN have been extensively developed for MLP and CNN, no prior work has addressed their design for transformers, despite the importance of transformers in modern deep learning. This paper aims to address this gap by providing a systematic study of NFN for transformers. We first determine the maximal symmetric group of the weights in a multi-head attention module as well as a necessary and sufficient condition under which two sets of hyperparameters of the multi-head attention module define the same function. We then define the weight…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The is the first paper to my knowledge expore neural functional networks (NFN) in the context of transformer, where NFN treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data. - The proposed Small Transformer Zoo dataset is a valuable resource for benchmarking and studying transformer-based networks, promoting reproducibility and further research.

Weaknesses

- The paper did not discuss the computational overhead associated with equivariant polynomial layers and whether this could affect scalability for larger transformer models.

Reviewer 02Rating 8Confidence 4

Strengths

The paper provides a very structured approach to a the solution of defining NFNs for transformers and is rigorous in it's approach.Some of the mathematical formulation around weight spaces of a multihead attention and the solution for parameter sharing of the NFN is good.

Weaknesses

The paper is a bit dense to read and understand. Though this can be considered fine given the topic and detailed approach. There could've been a better balance at analyzing the different theorems presented vs covering the mathematical proof. The experimentation is basic and doesn't necessarily capture the strengths of the method or even transformers. Transformers are known to generalize over large model sizes and large data and it is the biggest weakness of the paper on how this method would ho

Reviewer 03Rating 6Confidence 2

Strengths

1. The paper presents a novel theoretical framework for designing Neural Functional Networks that respects the inherent symmetries of Transformer architectures, representing an innovative approach previously unexplored in this domain. 2. The release of a dataset with 125,000 Transformer model checkpoints offers a valuable resource for future research on Transformer performance prediction and NFNs. 3. The paper provides a detailed methodological development, including thorough derivations and des

Weaknesses

1. While the use of equivariant layers in Transformer-NFN is theoretically compelling, empirical results suggest limited practical impact. Specifically, the marginal gains in predictive accuracy on the benchmark dataset raise questions about the method’s utility in real-world applications. 2. The architecture's intricate structure, requiring a specialized understanding of group actions and equivariant layers, could hinder adoption. The dense theoretical foundation, coupled with limited practical

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Neo-fuzzy-neuron