One-for-All Model Initialization with Frequency-Domain Knowledge

Jianlu Shen; Fu Feng; Yucheng Xie; Jiaqi Lv; Xin Geng

arXiv:2603.07523·cs.LG·March 10, 2026

One-for-All Model Initialization with Frequency-Domain Knowledge

Jianlu Shen, Fu Feng, Yucheng Xie, Jiaqi Lv, Xin Geng

PDF

Open Access 3 Reviews

TL;DR

This paper introduces FRONT, a frequency-domain method that extracts and transfers a model's foundational knowledge from low-frequency weight components, enabling flexible, training-free initialization of models across various scales with improved efficiency.

Contribution

The paper reveals that a model's core knowledge is encoded in low-frequency weights and proposes FRONT, a DCT-based framework for efficient, scalable, and training-free model initialization.

Findings

01

FRONT achieves state-of-the-art transfer performance.

02

Accelerates convergence by up to 15 times in vision tasks.

03

Reduces training FLOPs by 40.5% in language tasks.

Abstract

Transferring knowledge by fine-tuning large-scale pre-trained networks has become a standard paradigm for downstream tasks, yet the knowledge of a pre-trained model is tightly coupled with monolithic architecture, which restricts flexible reuse across models of varying scales. In response to this challenge, recent approaches typically resort to either parameter selection, which fails to capture the interdependent structure of this knowledge, or parameter prediction using generative models that depend on impractical access to large network collections. In this paper, we empirically demonstrate that a model's foundational, task-agnostic knowledge, its "learngene", is encoded within the low-frequency components of its weights, and can be efficiently inherited by downstream models. Based on this insight, we propose FRONT (FRequency dOmain kNowledge Transfer), a novel framework that uses the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1.The motivation of the paper is clear, and the writing is generally well-structured. 2.The paper provides evidence that task-agnostic knowledge resides in a model’s low-frequency components—an intuitively plausible and insightful finding. It also instantiates the learngene concept as low-frequency representations that can be readily extracted from the model. 3.The experiments are generally thorough and demonstrate the effectiveness of the proposed method.

Weaknesses

Please refer to the Questions section below.

Reviewer 02Rating 4Confidence 3

Strengths

- The proposed method extracts a low-frequency learngene and uses padding or truncating to initialize a variety of models across ViT and CNN. It generalizes well across different depths and width, with minimal computation needed. - The proposed method speeds up convergence and cuts compute versus scratch or learned-transform baselines.

Weaknesses

- The motivation behind the design is unclear. Why stacking weights across layers and then conduct 3D DCT, what if do this process on 2D weights and then use some selective process to get the learngene? - The presentation of the experimental results is not that clear, and the experimental settings are concernable. For instance, in table 1, it’s unclear to see what’s the base model in each block is used for initialization? And the results reported in the way of 10-epoch accuracy is not optimal.

Reviewer 03Rating 6Confidence 3

Strengths

1. The concrete instantiation of learngene as low-frequency components is intuitive and creative, with convincing evidence in Figure 1 demonstrating stability of low-frequency components across models and tasks. 2. FRONT's zero-cost extraction and flexible padding/truncation mechanism make it substantially more practical than training-based methods like GHN-3 and WAVE. 3. The evaluation spans ViT/ResNet/MLP/CNN architectures, multiple datasets, both vision and language domains, and systematic

Weaknesses

1. The frequency ratio r varies by model size (2.2M/3.2M/13.0M for Ti/S/B in Table 1) without principled justification, suggesting $r$ is model-size dependent. This systematic issue is not explored, and hyperparameters like decay rates $γ_d$ in Eq. 6 lack principled selection guidelines. 2. When comparing with training-based methods (WAVE/TLEG), FRONT+ also requires 150 epochs of training, so these should be evaluated separately from FRONT's direct extraction. 3. In Table 3, FRONT occasionally

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications