Recurrent Diffusion for Large-Scale Parameter Generation

Kai Wang; Dongwen Tang; Wangbo Zhao; Konstantin Sch\"urholt; Zhangyang; Wang; and Yang You

arXiv:2501.11587·cs.LG·February 12, 2025

Recurrent Diffusion for Large-Scale Parameter Generation

Kai Wang, Dongwen Tang, Wangbo Zhao, Konstantin Sch\"urholt, Zhangyang, Wang, and Yang You

PDF

Open Access 1 Repo 1 Models 1 Datasets 5 Reviews

TL;DR

Recurrent Diffusion for Large Scale Parameter Generation (RPG) is a new framework that efficiently generates large neural network parameters on a single GPU, enabling scalable and flexible model creation for various architectures and tasks.

Contribution

The paper introduces RPG, a novel recurrent diffusion-based method that scales parameter generation to hundreds of millions, surpassing previous memory and scalability limitations.

Findings

01

RPG achieves performance comparable to fully trained networks.

02

It generalizes to unseen tasks beyond its training set.

03

It enables large-scale parameter generation on a single GPU.

Abstract

Parameter generation has long struggled to match the scale of today large vision and language models, curbing its broader utility. In this paper, we introduce Recurrent Diffusion for Large Scale Parameter Generation (RPG), a novel framework that generates full neural network parameters up to hundreds of millions on a single GPU. Our approach first partitions a networks parameters into non-overlapping tokens, each corresponding to a distinct portion of the model. A recurrent mechanism then learns the inter token relationships, producing prototypes which serve as conditions for a diffusion process that ultimately synthesizes the full parameters. Across a spectrum of architectures and tasks including ResNets, ConvNeXts and ViTs on ImageNet 1K and COCO, and even LoRA based LLMs RPG achieves performance on par with fully trained networks while avoiding excessive memory overhead. Notably, it…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 3

Strengths

The paper is clearly written, the model straightforward and the experimental results quite convincing. The authors performed a series of insightful ablations (e.g., different sequence models, with/without sequence model, and different positional embeddings).

Weaknesses

My understanding is that the SANE method (Schürholt et al., 2022), which is mentioned in the related works section, should have similar scaling properties as RPG. It seems to me that OOM errors can be avoided by changing their tokenization scheme and/or reducing the window size of the sequential autoencoder, so I wonder if the OOM errors in table 8 are a bit misleading. What hyperparameters were chosen exactly? The main differences I see between RPG and SANE in the use of Mamba vs. a sequential

Reviewer 02Rating 5Confidence 2

Strengths

* The paper introduces a technique for parameter generation that appears to scale to larger models than some previous works. * The evaluation considers models and model architectures for different vision tasks and a language task.

Weaknesses

W1. While figure-2 provides a reasonable motivation, it is not clear from the main paper exactly what are the trade-offs that one should consider for parameter generation. E.g. Why should one consider parameter generation as opposed to training/tuning a model on a given dataset? Is there any advantage in terms of compute costs, if so which stages of the proposed RPG method contribute to it? W2. Some related works necessary to understand the main contributions of the paper are in the appendix.

Reviewer 03Rating 3Confidence 3

Strengths

- The paper introduces a new approach that combines recurrent neural mechanisms with diffusion-based generative modeling to effectively capture and stabilize dependencies between model parameters. This approach specifically targets complex parameter interdependencies that arise in large models, and experimental results suggest it leads to stable and consistent parameter generation, enhancing model robustness. - The method is validated across diverse architectures, including but not limited to Re

Weaknesses

- **Practical Limitations**: The method exhibits significant practical constraints for both similar and generalizable tasks. The reliance on numerous checkpoints (50 checkpoints) from fully trained models raises questions about its application to novel architectures. Although preliminary exploration of task transfer is presented, the necessity of training many models for seen tasks and the requirement for clearly defined task embeddings to relate seen and unseen tasks limits practical applicabil

Reviewer 04Rating 5Confidence 3

Strengths

- The paper presents a new approach for parameter generation combining autoregression and diffusion. They use SSMs (Mamba) to easily and effectively perform large-scale parameter generation. - The paper also presents a method for parameter tokenization, which they show later on performing significantly better than tokenization methods used by previous works. - The paper includes interesting ablation studies that show the contributions of different presented components.

Weaknesses

- The approach is very limited in novelty. Autoregressive models feeding embeddings into a diffusion model is not new in general. - The paper lacks a thorough analysis of whether the method genuinely learns to generalize the parameter distribution or simply memorizes the training data. There is a need for more evidence to show that the generated parameters are not merely reproducing the training checkpoints. - The evaluation on unseen tasks using CIFAR-10 with binary embeddings is not clearly ex

Reviewer 05Rating 8Confidence 4

Strengths

- Innovatively proposes using an RNN to model the relationships between different parts and then generating parameters for each part separately, solving the out-of-memory (OOM) problem. The method is simple and easy to understand. Its effectiveness is thoroughly validated through detailed ablation studies and analytical experiments. - Cleverly designs unseen tasks to test the characteristics of the generated model parameters, emphasizing the importance of parameter generation. - Provides detaile

Weaknesses

- The introduction of inference details is unclear. Does repeating the experiment ten times involve changing the permutation state used, or just altering some random state? - The author mentions in the limitations that the method is still limited to generating parameters for models with the same architecture and task.

Code & Models

Repositories

nus-hpc-ai-lab/recurrent-parameter-generation
pytorchOfficial

Models

🤗
MTDoven/Recurrent-Parameter-Generation
model· ♡ 5
♡ 5

Datasets

MTDoven/ViTTiny1022
dataset· 793 dl
793 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Measurement and Metrology Techniques · Heat Transfer and Optimization · Topology Optimization in Engineering

MethodsDiffusion