Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers
Miit Daga, Swarna Priya Ramu

TL;DR
This study investigates how different neural network architectures forget training samples during fine-tuning, revealing architecture-dependent, stochastic, and class-related patterns with implications for model ensemble and data management.
Contribution
It provides a detailed analysis of sample forgetting dynamics across architectures, highlighting the non-intrinsic nature of sample difficulty and the limitations of static curriculum methods.
Findings
ResNet-18 and DeiT-Small forget different samples with low overlap.
ViT exhibits more structured forgetting than CNNs.
Sample forgetting is highly stochastic across different training runs.
Abstract
Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample's retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean ) than CNN forgetting (). Third, per-sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
