Accelerating Training with Neuron Interaction and Nowcasting Networks

Boris Knyazev; Abhinav Moudgil; Guillaume Lajoie; Eugene Belilovsky,; Simon Lacoste-Julien

arXiv:2409.04434·cs.LG·March 3, 2025

Accelerating Training with Neuron Interaction and Nowcasting Networks

Boris Knyazev, Abhinav Moudgil, Guillaume Lajoie, Eugene Belilovsky,, Simon Lacoste-Julien

PDF

Open Access 1 Repo 1 Datasets 1 Video 3 Reviews

TL;DR

This paper introduces NiNo networks that enhance weight nowcasting by leveraging neuron connectivity and graph neural networks, significantly accelerating neural network training in vision and language tasks.

Contribution

The paper proposes NiNo networks, improving upon WNNs by modeling neuron interactions with graph neural networks for more accurate parameter nowcasting.

Findings

01

NiNo accelerates training by up to 50% in vision tasks.

02

Neuron connectivity modeling improves nowcasting accuracy.

03

NiNo outperforms previous methods like WNNs in various tasks.

Abstract

Neural network training can be accelerated when a learnable update rule is used in lieu of classic adaptive optimizers (e.g. Adam). However, learnable update rules can be costly and unstable to train and use. Recently, Jang et al. (2023) proposed a simpler approach to accelerate training based on weight nowcaster networks (WNNs). In their approach, Adam is used for most of the optimization steps and periodically, only every few steps, a WNN nowcasts (predicts near future) parameters. We improve WNNs by proposing neuron interaction and nowcasting (NiNo) networks. In contrast to WNNs, NiNo leverages neuron connectivity and graph neural networks to more accurately nowcast parameters. We further show that in some networks, such as Transformers, modeling neuron connectivity accurately is challenging. We address this and other limitations, which allows NiNo to accelerate Adam training by up…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

I think the construction of the graph whose isomorphic permutations of nodes preserves the model functionality, for transformer layer is neat and interesting.

Weaknesses

I have two major concerns about the papers: 1. My main concern is that the experimental section contains only fairly trivial datasets (FashionMNist, Cifar-10), which are very far from anything reasonable these days, and the models authors consider for forecasting is limited to ~1M parameters, and many are 15K params, which is barely practical for the simplest tasks. I think for image tasks, showing reasonable performance on something like ImageNet is a must. On the other hand authors run their

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper is sufficiently well written and is fairly accessible. 2. The proposed approach is sufficiently sound and novel. Even though it can be seen as a combination of two existing techniques (WNNs and an improved GNN model weight representation), this paper still contains a number of non-trivial innovations. For example, among other things, the authors make a number of logical steps improving on a previously published graph topology for multi-headed self-attention. 3. Experimental results

Weaknesses

1. Some discussions could perhaps be improved upon to be even more clear. For example, while being sufficiently understandable, Section 4.1 could still be clarified further. Figure 2 is also difficult to interpret in its current form. Color coding takes time to digest. 2. The training method is fairly computationally expensive as the authors collect on the order of $10^6$ checkpoints. To be practical, this initial computational investment should be compensated by the future computational wins fr

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper is well-motivated. It is an improved version of WNN by integrating GNN. 2. The experiments cover different tasks including language modeling and image classification tasks.

Weaknesses

1. It is unclear if training multiple models during meta-training is practical for real-world applications, where typically only a limited number of models are trained. 2. The generalization performance of NiNo should be further tested. The largest test case is 100 M models on small dataset like Wikitext-103. It may not fully represent NiNo's capabilities in broader applications.

Code & Models

Repositories

samsungsailmontreal/nino
pytorchOfficial

Datasets

SamsungSAILMontreal/nino_metatrain
dataset· 853 dl
853 dl

Videos

Accelerating Training with Neuron Interaction and Nowcasting Networks· slideslive

Taxonomy

TopicsNeural Networks and Applications

MethodsAdam