Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers
Rowan Bradbury, Aniket Srinivasan Ashok, Sai Ram Kasanagottu, Gunmay Jhingran, Shuai Meng

TL;DR
This paper introduces Deterministic Continuous Replacement (DCR), a method for stable and efficient module replacement in pretrained transformers, addressing stability issues during the replacement process.
Contribution
The paper presents DCR, a deterministic and theoretically grounded approach that improves stability and convergence in replacing modules within pretrained models.
Findings
DCR achieves faster convergence than stochastic methods.
DCR provides stronger alignment in attention replacement tasks.
DCR eliminates gate-induced gradient variance.
Abstract
Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. In a single-seed study, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Neural Networks and Reservoir Computing
