An evolutionary perspective on modes of learning in Transformers
Alexander Y. Ku, Thomas L. Griffiths, Stephanie C.Y. Chan

TL;DR
This paper explores how environmental stability and cue reliability influence the preference for in-weight learning or in-context learning in Transformers, revealing task-dependent strategy transitions driven by environmental and task factors.
Contribution
It introduces an evolutionary framework to understand the conditions favoring different learning modes in Transformers, supported by controlled experiments and analysis.
Findings
Stable environments favor in-weight learning (IWL).
Reliable cues promote in-context learning (ICL) in volatile environments.
Transitions between strategies depend on environmental and task-specific factors.
Abstract
The success of Transformers lies in their ability to improve inference through two complementary strategies: the permanent refinement of model parameters via in-weight learning (IWL), and the ephemeral modulation of inferences via in-context learning (ICL), which leverages contextual information maintained in the model's activations. Evolutionary biology tells us that the predictability of the environment across timescales predicts the extent to which analogous strategies should be preferred. Genetic evolution adapts to stable environmental features by gradually modifying the genotype over generations. Conversely, environmental volatility favors plasticity, which enables a single genotype to express different traits within a lifetime, provided there are reliable cues to guide the adaptation. We operationalize these dimensions (environmental stability and cue reliability) in controlled…
Peer Reviews
Decision·ICLR 2026 Poster
The paper comes with a clear, motivating framing that yields testable experimental axes (stability and cue reliability). The scientific claims are substantiated with systematic empirical evaluation with dense sweeps and temporal analyses across many configurations. Coherent qualitative findings across two tasks: stability favors IWL; reliable cues favor ICL; transience depends on relative difficulty. Efficient experimental design allowed broad exploration and reproducibility in principle; hyp
I might be mistaken but is there a critical and central inconsistency?: SICL is defined as SICL = EIWL / (EICL + EIWL + eps) but interpreted (and plotted) as higher SICL meaning more ICL. Insufficient robustness: only 3 seeds per configuration; many key effects (thresholds, transience) need more seeds and statistical tests. Limited external validity: only two simplified tasks and a single small Transformer; applicability to larger models and potentially LLMs or naturalistic domains is untested
S1: The experiments are quite novel and interesting, even though some past studies study curriculum learning where the loss landscape keeps changing, studying this together with the question of ICL vs IWL is genuinely interesting. S2: The **quality** of the paper's presentation is great, even though the framing seems questionable (see W1). The paper is well-written and figures are clear. The experimental design is clean and easy to understand. S3: The identification of IWL->ICL transience (in
W1: This is my only major concern of the paper. The connection to evolutionary biology seems to be just at the very high level. The framing makes sense, but does not actually offer any falsifiable statements or predictions. While this could have been a nice discussion point, I don't see why this should be a main theme of the paper instead of simply framing it as a study of circuit competition on non stationary losses. Furthermore the training method lacks any kind of evolutionary mechanism such
The question is interesting. The experiments are informative. The review of the literature (on both evolution and ML) is nice.
There seems to be no fatal flaw in the paper that I can see. It may be argued that some of the results are not really earth-shattering ("more stability = more overfitting?"), though the additional experiments in Figures 4 and 5 provide more details on the dynamics. It's not clear how much the "relative cost" hypothesis helps, because there seems to be no precise definition of "cost", except for a-posteriori hardness on IWL? (I note that the parameter used to tune Omniglot task is the total num
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
