Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory

Yao Shu; Jian Mu; Zhongxiang Dai

arXiv:2605.10658·cs.LG·May 12, 2026

Why Zeroth-Order Adaptation May Forget Less: A Randomized Shaping Theory

Yao Shu, Jian Mu, Zhongxiang Dai

PDF

TL;DR

This paper introduces a randomized shaping theory for zeroth-order adaptation in continual learning, explaining why ZO methods may forget less than first-order methods by analyzing the curvature exposure and retention properties.

Contribution

It provides a local randomized gradient-shaping analysis that clarifies the retention benefits of ZO adaptation and proposes the RISE algorithm for improved stability-plasticity tradeoff.

Findings

01

ZO improves mean forgetting when the FO direction has above-average retention curvature.

02

The analysis separates mean-step damage from random exposure, highlighting the role of curvature and blockwise sampling.

03

RISE applies calibrated ZO shape to exact FO gradients, enhancing stability in continual learning.

Abstract

Continual learning requires new-task adaptation without damaging previously acquired capabilities. Recent forward-pass and zeroth-order (ZO) results show that low-query adaptation may retain better than first-order (FO) descent, but the usual view of ZO as noisy FO estimation does not explain why. We give a local randomized gradient-shaping analysis: finite differences expose a raw shape that is mean-aligned with FO, while the norm-matched comparator fixes the expected squared adaptation norm. Under this controlled comparison, forgetting depends on how the adaptation shape exposes retention curvature. For norm-matched ZO, the expected shaped retention curvature obeys an exact identity that preserves the isotropic retention floor while contracting only the anisotropic component. Projecting this identity onto the incoming gradient yields the observable FO--ZO quadratic forgetting gap: ZO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.