Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers

Rowan Bradbury; Aniket Srinivasan Ashok; Sai Ram Kasanagottu; Gunmay Jhingran; Shuai Meng

arXiv:2511.18670·cs.LG·November 25, 2025

Deterministic Continuous Replacement: Fast and Stable Module Replacement in Pretrained Transformers

Rowan Bradbury, Aniket Srinivasan Ashok, Sai Ram Kasanagottu, Gunmay Jhingran, Shuai Meng

PDF

Open Access

TL;DR

This paper introduces Deterministic Continuous Replacement (DCR), a method for stable and efficient module replacement in pretrained transformers, addressing stability issues during the replacement process.

Contribution

The paper presents DCR, a deterministic and theoretically grounded approach that improves stability and convergence in replacing modules within pretrained models.

Findings

01

DCR achieves faster convergence than stochastic methods.

02

DCR provides stronger alignment in attention replacement tasks.

03

DCR eliminates gate-induced gradient variance.

Abstract

Replacing modules in pretrained models, especially swapping quadratic self-attention for efficient attention alternatives, poses a hard optimization problem: cold-start reinitialization destabilizes frozen backbones. We isolate this core stability challenge in a controlled study. Deterministic Continuous Replacement (DCR) blends teacher and student outputs with a deterministic, annealed weight. Theoretically, DCR eliminates gate-induced gradient variance inherent to stochastic replacement. In a single-seed study, DCR attains faster convergence and stronger alignment than stochastic gating and distillation baselines on controlled attention replacement, establishing a foundation for heterogeneous operator swaps.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning in Materials Science · Neural Networks and Reservoir Computing