Training Memory in Deep Neural Networks: Mechanisms, Evidence, and Measurement Gaps
Vasileios Sevetlidis, George Pavlidis

TL;DR
This paper surveys the mechanisms behind memory effects in deep neural network training, introduces new causal estimands and measurement protocols, and emphasizes the importance of understanding training history influence.
Contribution
It provides a comprehensive organization of training memory mechanisms, introduces novel causal estimands and perturbation primitives, and proposes a protocol for measuring training history effects.
Findings
Training memory effects depend on optimizer states, data order, and auxiliary states.
Introduces seed-paired causal estimands and perturbation primitives for analysis.
Proposes a reporting checklist and protocol for measuring training history influence.
Abstract
Modern deep-learning training is not memoryless. Updates depend on optimizer moments and averaging, data-order policies (random reshuffling vs with-replacement, staged augmentations and replay), the nonconvex path, and auxiliary state (teacher EMA/SWA, contrastive queues, BatchNorm statistics). This survey organizes mechanisms by source, lifetime, and visibility. It introduces seed-paired, function-space causal estimands; portable perturbation primitives (carry/reset of momentum/Adam/EMA/BN, order-window swaps, queue/teacher tweaks); and a reporting checklist with audit artifacts (order hashes, buffer/BN checksums, RNG contracts). The conclusion is a protocol for portable, causal, uncertainty-aware measurement that attributes how much training history matters across models, data, and regimes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
