Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair
Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, and Li Song

TL;DR
This paper reveals a hidden failure mode in gradient modification with Adam in continual learning and proposes an adaptive decoupled routing method as a simple, effective repair to prevent collapse.
Contribution
It identifies a failure mode caused by gradient projection in Adam and introduces an adaptive decoupled routing technique that stabilizes continual learning across multiple methods.
Findings
Shared-routing projection baselines collapse to vanilla forgetting.
Adaptive decoupled routing remains stable and improves performance.
The failure is linked to Adam's second-moment pathway inflation.
Abstract
Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks. We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
