FLOPS: Forward Learning with OPtimal Sampling
Tao Ren, Zishi Zhang, Jinyang Jiang, Guanghao Li, Zeliang Zhang,, Mingqian Feng, Yijie Peng

TL;DR
This paper introduces an optimal query allocation method for forward learning that reduces gradient estimation variance, improving scalability and efficiency in training vision models and other applications.
Contribution
It proposes a novel, theoretically verified query allocator that adaptively assigns queries per data point to balance accuracy and computational cost in forward learning.
Findings
Significantly improves training efficiency for Vision Transformers.
Enhances scalability of forward-learning algorithms in practical applications.
Demonstrates effectiveness in black-box tasks like prompt tuning and multimodal alignment.
Abstract
Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique,…
Peer Reviews
Decision·ICLR 2025 Poster
1. The empirical accuracy results for ViT and CLIP appear to surpass those of the baselines. The authors have also conducted essential ablation studies to further validate their findings. 2. While the text in Figures 2 and 3 is smaller than the standard text size, making it challenging to read, the color combinations used in these figures are visually appealing.
1. Sections 3 and 4 introduce several undefined annotations, leading to ambiguity in the exposition. For instance, Line 148 mentions a term "a" whose role and relation to Equation (1) are unclear—is it a distribution, a hidden representation, or something else? Furthermore, "G(·)" on Line 160 and "y_j" on Line 164 are undefined, with no clarification of the indexing or distribution from which j is sampled. Additionally, Equation (9)'s term "K" lacks a defined scope. The abbreviation "LR" is also
1. The idea of dynamically allocating different numbers of queries to each data point within a batch during training is novel, which is indeed a point that previous zeorth-order optimization (forward learning) methods have not considered. 2. The proposed method is intuitive. The approach of leveraging a Gaussian Allocator (GA) combined with a likelihood ratio method introduces a creative solution to minimize gradient estimation variance. Through appropriate approximations, the computational cos
1. Although the authors provide part of the source code, I believe the coding is not advisable. Specifically, the authors override nn.Linear to create a custom Linear class and similarly override nn.Conv2d to create a custom Conv2d class. This approach results in the proposed method being tied to a specific model architecture, making it difficult to adapt to other architectures. In fact, existing zeroth-order optimization methods, such as ZO-SGD [1], ZO-AdaMM [2], and DeepZero [3], all have core
1) The motivation is clear: optimizing the allocation of queries to effectively reduce computational overhead. 2) The experimental results show strong performance relative to the baselines. 3) The study provides both experimental and theoretical results, offering a well-rounded evaluation.
1) I am curious about why other methods that utilize all queries would perform worse than this method, that utilizes limited quries for each data. 2) The comparison of exact computational cost between equally using all queries for each data point and your allocation method is unknown. However, it is one of the main motivation. Minor: 3) In the abstract, the phrase “propose to allocate the optimal number of queries over each data” isn’t entirely accurate, as a total query budget must be pr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEnergy Efficient Wireless Sensor Networks
MethodsFocus
