Generalization vs. Memorization in Autoregressive Deep Learning: Or, Examining Temporal Decay of Gradient Coherence

James Amarel; Nicolas Hengartner; Robyn Miller; Kamaljeet Singh; Siddharth Mansingh; Arvind Mohan; Benjamin Migliori; Emily Casleton; Alexei Skurikhin; Earl Lawrence; Gerd J. Kunde

arXiv:2509.00024·physics.comp-ph·January 21, 2026

Generalization vs. Memorization in Autoregressive Deep Learning: Or, Examining Temporal Decay of Gradient Coherence

James Amarel, Nicolas Hengartner, Robyn Miller, Kamaljeet Singh, Siddharth Mansingh, Arvind Mohan, Benjamin Migliori, Emily Casleton, Alexei Skurikhin, Earl Lawrence, Gerd J. Kunde

PDF

Open Access

TL;DR

This paper investigates the balance between generalization and memorization in autoregressive deep learning models for PDE surrogates, revealing limitations and guiding improved model design for scientific discovery.

Contribution

It introduces an influence function-based framework to analyze how these models assimilate information, exposing fundamental limitations and offering insights for better surrogate design.

Findings

01

Standard models show limited generalization beyond training data

02

Influence functions reveal how information propagates in models

03

Insights lead to improved surrogate training strategies

Abstract

Foundation models trained as autoregressive PDE surrogates hold significant promise for accelerating scientific discovery through their capacity to both extrapolate beyond training regimes and efficiently adapt to downstream tasks despite a paucity of examples for fine-tuning. However, reliably achieving genuine generalization - a necessary capability for producing novel scientific insights and robustly performing during deployment - remains a critical challenge. Establishing whether or not these requirements are met demands evaluation metrics capable of clearly distinguishing genuine model generalization from mere memorization. We apply the influence function formalism to systematically characterize how autoregressive PDE surrogates assimilate and propagate information derived from diverse physical scenarios, revealing fundamental limitations of standard models and training routines…

Equations39

S (δ θ) = d C [δ θ] + \frac{1}{2} ∣∣ \overset{y}{^} (θ) - \overset{y}{^} (θ + δ θ) ∣ ∣_{L_{2}}^{2},

S (δ θ) = d C [δ θ] + \frac{1}{2} ∣∣ \overset{y}{^} (θ) - \overset{y}{^} (θ + δ θ) ∣ ∣_{L_{2}}^{2},

δ θ^{μ} = - η^{μν} \partial_{ν} C,

δ θ^{μ} = - η^{μν} \partial_{ν} C,

L_{V} Q

L_{V} Q

= - (\frac{δ Q}{δ y ^ ^{n}}, Π^{nm} \frac{δ C}{δ y ^ ^{m}}),

Π^{nm} = J_{μ}^{n} η^{μν} J_{ν}^{m},

Π^{nm} = J_{μ}^{n} η^{μν} J_{ν}^{m},

r_{C}

r_{C}

r_{M}

r_{E}

H_{A B} (t, n ∣ τ, m) = (r_{A}^{n t}, Π_{t τ}^{nm} r_{B}^{m τ}),

H_{A B} (t, n ∣ τ, m) = (r_{A}^{n t}, Π_{t τ}^{nm} r_{B}^{m τ}),

s_{t + 1} = U [s_{t}],

s_{t + 1} = U [s_{t}],

C_{SMSE} (θ) = \frac{1}{N} n = 1 \sum N \frac{1}{4} c \sum \frac{∣∣ y ^ _{θ}^{c} ( s _{t_{n}} ) - s _{t_{n} + 1}^{c} ∣ ∣ _{L_{2}}^{2}}{RMS ( s _{t_{n} + 1}^{c} )},

C_{SMSE} (θ) = \frac{1}{N} n = 1 \sum N \frac{1}{4} c \sum \frac{∣∣ y ^ _{θ}^{c} ( s _{t_{n}} ) - s _{t_{n} + 1}^{c} ∣ ∣ _{L_{2}}^{2}}{RMS ( s _{t_{n} + 1}^{c} )},

δ C = \frac{δ C}{δ y} δ y,

δ C = \frac{δ C}{δ y} δ y,

δ \overset{y}{^}^{n} = L_{V} \overset{y}{^}^{n} = - Π^{n l} \frac{δ ^{2} C}{δ y ^ ^{l} δ y ^{m}} δ y^{m};

δ \overset{y}{^}^{n} = L_{V} \overset{y}{^}^{n} = - Π^{n l} \frac{δ ^{2} C}{δ y ^ ^{l} δ y ^{m}} δ y^{m};

η_{μν} \to η_{μν} + λ δ_{ν μ},

η_{μν} \to η_{μν} + λ δ_{ν μ},

\partial_{t} ρ_{c} + \nabla \cdot J_{c} = 0,

\partial_{t} ρ_{c} + \nabla \cdot J_{c} = 0,

J_{mass}^{j} = ρ_{mass} v^{j}

J_{mass}^{j} = ρ_{mass} v^{j}

J_{mom}^{ij} = ρ_{mass} v^{i} v^{j} + p δ^{ij}

J_{energy}^{j} = (ρ_{energy} + p) v^{j}

ρ_{energy} = ρ_{mass} e + \frac{1}{2} ρ_{mass} ∣ v ∣^{2},

ρ_{energy} = ρ_{mass} e + \frac{1}{2} ρ_{mass} ∣ v ∣^{2},

p = (γ - 1) ρ_{mass} e,

p = (γ - 1) ρ_{mass} e,

\frac{d}{d t} \int_{Ω} ρ_{c} d A + vanishes for periodic Ω \int_{\partial Ω} J_{c} \cdot n d S = 0.

\frac{d}{d t} \int_{Ω} ρ_{c} d A + vanishes for periodic Ω \int_{\partial Ω} J_{c} \cdot n d S = 0.

Q_{c} (t) = \int_{Ω} ρ_{c} d A

Q_{c} (t) = \int_{Ω} ρ_{c} d A

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

Full text

Generalization vs. Memorization in Autoregressive Deep Learning:

Or, Examining Temporal Decay of Gradient Coherence

James Amarel

Nicolas Hengartner

Robyn Miller

Kamaljeet Singh

Siddharth Mansingh

Arvind Mohan

Benjamin Migliori

Emily Casleton

Alexei Skurikhin

Earl Lawrence

Gerd J. Kunde

Abstract

Foundation models trained as autoregressive PDE emulators hold significant promise for accelerating scientific discovery through their capacity to both extrapolate beyond training regimes and efficiently adapt to downstream tasks despite a paucity of examples for fine-tuning. However, reliably achieving genuine generalization—a necessary capability for producing novel scientific insights and robustly performing during deployment—remains a critical challenge. Establishing whether these requirements are met demands evaluation metrics capable of clearly distinguishing genuine generalization from mere memorization. We apply the influence function formalism to systematically characterize how autoregressive PDE emulators assimilate and propagate information derived from diverse physical scenarios, revealing fundamental limitations of standard models and training routines in addition to providing actionable insights regarding the design of improved surrogates.

Machine Learning, ICML

1 Introduction

Machine learning surrogate models has emerged as a powerful technique for efficiently approximating the solutions of computationally intensive partial differential equations (PDEs). These emulation methods range from purely data-driven approaches, trained on high-fidelity simulation data, to physics-informed neural networks, which integrate PDE structures directly into the training loss, enforcing physical laws as soft constraints (Raissi et al., 2019). Such models hold promise for achieving significant computational acceleration in applications such as fluid dynamics (Takamoto et al., 2024; Lippe et al., 2023; Gupta and Brandstetter, 2022; Ohana et al., 2025; Herde et al., 2024), climate modeling (Bodnar et al., 2024), and materials science (Batatia et al., 2024), enabling rapid research and development across diverse scientific and engineering disciplines. Despite these advances, reliable generalization and robustness remains a critical challenge (Krishnapriyan et al., 2021). Before surrogate models can be safely deployed in operational environments demanding generalization beyond their training data, it is essential to develop methods capable of quantifying risk profiles and ensuring trustworthiness of predictions.

Distinguishing between memorization of training examples and genuine generalization is critical to evaluating model robustness; diagnostic tools such as influence functions (Koh and Liang, 2020; Bae et al., 2022), leverage scores, and gradient alignment analyses offer promising avenues for characterizing this balance, revealing whether models rely appropriately on generalized understanding or disproportionately on memorized patterns (Fort et al., 2020; Chatterjee, 2020; Chatterjee and Zielinski, 2022; Zielinski et al., 2020).

Autoregressive models, useful for their promising extrapolation capabilities, accumulate errors during inference, in part due to the inevitable distribution shift that originates with the usage of model outputs as inputs to drive the predicted evolution arbitrarily far into the future (Lee, May 1, 2023; Brandstetter et al., 2023). While such inference-time error accumulation is significant, we reveal a decisive learning limitation that also contributes to the difficulty of achieving stable long-term rollouts: gradient signals fail to propagate coherently across time, which implies that ordinary training lacks a mechanism for generalizing from supervision of one-step predictions to multi-frame dynamical evolution governed by a shared update structure. The length of time that a model performs (and is confident) prior to excessive prediction error defines a “trust horizon” for forward prediction that is contingent on its encoding of the true data‑generating mechanisms—the physics—rather than merely exploiting empirical correlations among proximal points in feature space to perform local statistical interpolation.

Traditional benchmarks, such as point-wise mean squared error evaluations on limited validation datasets, often fail to adequately capture surrogate model reliability, especially when faced with variations in initial or boundary conditions, mesh resolutions, or varying physical parameter regimes (Setinek et al., 2025). Physics-informed metrics, including conservation-law violation assessments, PDE residual norms, analytical-limit checks, and numerical stability evaluations, have been proposed to better reflect model robustness (Karniadakis et al., 2021), yet even these enriched criteria are not guaranteed to fully quantify the true worst-case prediction errors. Indeed, empirical accuracy metrics based on finitely many examples can dramatically underestimate the true worst‐case error, especially when the data is noisy, sparse, or incompletely understood (Vapnik, 1998).

In scientific machine learning, limited availability of high-fidelity simulation data often results in narrow training distributions, making it challenging to develop robust emulators. On queries poorly represented by the training set, data-driven predictive models risk producing non-physical artifacts, such as violations of conservation laws, causality, or symmetry. While transfer learning and multi-fidelity methods have emerged to alleviate data scarcity, ensuring physically consistent generalization remains a significant challenge (Herde et al., 2024). Towards addressing this gap, current research increasingly emphasizes the development of PDE foundation models designed to achieve robust and unified generalization across diverse physical scenarios (Sun et al., 2025; Ye et al., 2024; Herde et al., 2024; Subramanian et al., 2023). Contemporary PDE emulators employ a variety of architectures (Li et al., 2021; Lu et al., 2021; Gregory et al., 2024; Shankar et al., 2023); however, most large-scale deployments rely on UNet (Ronneberger et al., 2015) or Transformer backbones (Vaswani et al., 2023; Liu et al., 2021; Dosovitskiy et al., 2021), and there remains no consensus on which model variant is most capable at scale. One must balance ease of optimization with the incorporation of physics priors, but quantitative tools for comparing loss-landscape properties across these architectures remain under-explored.

Insight into surrogate model behavior beyond static accuracy metrics can be gained through analysis of the model gradients. Combining test example error evaluation with gradient examination allows for interpolation of prediction errors across the underlying data manifold; for instance, PINNs can be certified with continuous-domain error bounds (Eiras et al., 2024). By quantifying gradient overlap among different training examples, it is possible to identify potential conflicts or synergies present during learning and inherent to fully trained models. Precisely how gradients derived from individual training samples propagate through model parameters is formalized through the use of influence functions (Hampel, 1974; Cook and Weisberg, 1982). Influence functions were originally developed in robust statistics (Huber and Ronchetti, 2009) to quantify how small perturbations of a data point in the training set affect model parameter estimations (Koh and Liang, 2020; Bae et al., 2022). Diagonal elements of the influence function measure each training example’s self-leverage; high-leverage points thereby identifying data that exerts disproportionate impact during training.

For PDE surrogates, the influence framework can also pinpoint examples providing gradient signals that exacerbate violations of physical constraints (Naujoks et al., 2024). Furthermore, influence functions reveal spatio-temporal correlations inherent in PDE emulator learning (Wang et al., 2025), distinguishing between memorization and generalization in cases where the underlying solution operator lacks explicit space-time dependence, in addition to exposing gradient misalignments across distinct initial conditions and inputs that are well separated in feature space. When applied to PDE foundation models, these techniques systematically characterize model stability, generalization capability, and uncertainty under structured domain shifts and multi-physics scenarios, in addition to uncovering subtle failure modes typically missed by conventional evaluation metrics, thereby enabling targeted refinements of model, architecture, and training routines that yield more robust, physically-consistent, data-driven models (Ren et al., 2019; Zhang and Pfister, 2021).

2 Related Work

Influence functions are powerful tools for understanding model behavior and data importance (Koh and Liang, 2020; Bae et al., 2022). Robust and interpretable criteria for detecting anomalous inputs follow from techniques that analyze the alignment of gradients (Wang et al., 2025) by quantifying directional consistency with in-distribution data (Huang et al., 2021), employ orthogonal projection (Behpour et al., 2023) to isolate anomalous components, and outlier gradient analysis (Chhabra et al., 2025).

Fort et al. (Fort et al., 2020) define stiffness in terms of the dot-product between the loss-gradients of two inputs. A positive stiffness then means that a stochastic gradient descent (SGD) step benefiting one example simultaneously lowers the loss of the other, evidence that the network assimilated shared, transferable features. Two summary statistics: sign-stiffness and cosine-stiffness, emphasize inter-class and intra-class correlations, respectively. Plotting stiffness against input-space distance yields a dynamic correlation length—the distance where average stiffness first crosses zero—which shrinks over epochs, revealing how the learned function becomes progressively more localized as specialization sets in.

The Coherent Gradients Hypothesis (Chatterjee, 2020) proposed that per‑example gradients tend to align for similar inputs, so SGD steps amplify directions supported by many examples while suppressing idiosyncratic ones, steering the network toward functions that generalize rather than memorize. Extensions of the Coherent Gradients Hypothesis (Zielinski et al., 2020) posit that SGD updates aligned across multiple training examples (strong directions) underpin generalization, whereas idiosyncratic updates (weak directions) promote memorization. They introduce optimizers that suppress weak directions without computing per-example gradients, dramatically reducing the train-test gap-even in the presence of heavy label noise-and thereby offer the first large-scale confirmation of the hypothesis. Complementing this view, He and Su (He and Su, 2020) establish the notion of local elasticity: in some neural networks, a parameter update perturbs predictions only within a narrow neighbourhood around the training point.

PINNfluence (Mlodozeniec et al., 2025) interrogates a trained physics-informed neural network under perturbations to the PDE parameters and reweighting of collocation points. They distill raw pointwise influences into physically meaningful diagnostics such as the directional indicator, which measures the fraction of influence that propagates downstream with the fluid flow.

3 Our Contributions

We make three key advances toward principled analysis and validation of PDE emulators:

Time-Aware Analysis of Off‐Diagonal Influence Function Elements: A systematic study off-diagonal influence function elements for PDE surrogate models, capable of quantifying training-sample leverage across physical time [see Figure 1]. This diagnostic sets standards for identifying the learning of persistent nontrivial correlations that extend across temporal horizons, thereby identifying when the emulator network has internalized fundamental, time-invariant PDE structures.

Gradient-Coherence Diagnostics Across Initial Condition Classes: We determine the degree of alignment of gradients computed across different classes of PDE solutions for two standard architectures, a UNet and a ViT. Strong alignment signals the learning of robust, transferable physics, whereas weak alignment suggests that the neural network embeds these classes on separated regions of the input manifold, with limited gradient coherence, despite the fact that the data represents solutions to the same underlying PDE. 3. 3.

Dynamic Correlation Length and Curvature Diagnostics: We show that autoregressive PDE emulators generically exhibit a limited dynamic correlation length (Fort et al., 2020), directly observable through the rapid decay of influence with increasing feature-space distance. Such feature-space localization provides an explicit, training-time explanation for why such models fail to reuse dynamical structure or internalize shared physical laws beyond narrow neighborhoods of the data manifold. Complementing this evidence, spectral analyses of the neural tangent kernel metric show that low test error is typically achieved in a highly anisotropic regime: while most directions remain flat, a small number of dominant eigenmodes exhibit large curvature, corresponding to sharp, high-sensitivity directions rather than globally robust solutions. This spectral imbalance clarifies why apparent interpolation success does not imply robustness, and why learned dynamics fail to transfer coherently across time or conditions despite favorable one-step performance (Karakida et al., 2019; Anonymous, 2025).

This paper is organized as follows. For the readers’ convenience, we first present our central results section 4, exposing pronounced lack of generalization capabilities in autoregressive PDE emulators. Technical details—those covering both the mathematical formulation of the influence-function framework in addition to our training procedures—are provided in section 5 and section 6, respectively.

4 Results

We examine how training information propagates across time and initial-condition classes in autoregressive PDE emulators, using influence-based diagnostics evaluated on held-out test data. Across architectures, physical observables, and datasets, we find that gradient responses are strongly localized in both time and class, with off-diagonal influence rapidly decaying, indicating that these models primarily learn time- and class-indexed update rules rather than a globally consistent dynamical operator.

Test-data measurements of the two-time influence function [see Equation 6] for both a UNet and a ViT tasked with emulating fluid flow exhibit rapid temporal decay in the off-diagonal terms [see Figure 2], which indicates that surrogate training constructs localized vector fields suitable only for interpolation within small neighborhoods of the training data sub-manifold, rather than the universally consistent function that is desired based on expectations stemming from our knowledge of the underlying governing equations. If such models were truly learning the solution operator to a PDE that lacks explicit time dependence, gradients derived from examples at a given time would necessarily have a profound effect on the predictions at any other time, for we know that the true solution operator must be time-translation equivariant, taking the same functional form at every point in phase space. This superfluous time awareness presents across the entire training trajectory, demonstrating that our models did not learn the underlying solution operator.

Consistent with the two-time influence maps, the class-to-class transferability matrix in Figure 3 is strongly diagonal, indicating that gradient geometry is effectively class-locked: updates supported by one initial-condition family produce negligible response in the others. Furthermore, there is a near-total absence of inter-class influence [see Figure 4].

The degree of gradient alignment across examples also affords conclusions about the data manifold sparsity: while all inputs to the network are intimately related as unique solutions to a shared equation of motion under different initial conditions, both our ViTs and our UNets render inputs well separated in the sense that their gradients don’t meaningfully overlap unless their feature space distance small [see Figure 5], which implies limited generalization over dynamical structure away from nearby states.

Such results challenge a fundamental assumption motivating the development of PDE foundation models, as it demonstrates that these models are prone to effectively treating different flow fields as distinct, isolated learning tasks. That this happens even when said classes of solutions arise merely from different initial conditions to the same physical process underscores the need for inductive biases to be explicitly incorporated during model development.

In addition to the overlap of cost function gradients, we also considered gradients derived from physics informed loss functions, such as global mass conservation [see Figure 6] and global energy conservation. In all cases, we observed that the response function decayed off the time-diagonal and was dominated by intra-class matrix elements [see Figure 7]. Hence, we conclude that predictive models lacking explicit inductive biases are not internalizing a unified governing law, but merely allocating parameters tasked specifically with evolving states associated with a given time along a given class of trajectories.

Lastly, Figure 8 shows that the dominant NTK eigenmodes are large, revealing a stiff, highly anisotropic local response geometry; in particular, low test error coexists with sharp high-curvature modes rather than a uniformly flat, robust geometry.

5 Proximal Response Function

We develop the influence function as follows, taking inspiration from (Bae et al., 2022). Let $\theta$ be the current parameter values and consider the optimization step $\theta\leftarrow\theta+\delta\theta,$ where the tangent-space displacement $\delta\theta$ minimizes the proximal objective

[TABLE]

with $\hat{y}$ a neural network. The stationarity condition $dS\stackrel{{\scriptstyle!}}{{=}}0$ is satisfied (to linear order) by

[TABLE]

where $\eta_{\mu\nu}=J^{n}_{\mu}J^{n}_{\nu}$ is the neural tangent kernel metric, $J_{\mu}^{n}=\partial_{\mu}\hat{y}^{n}$ is the model Jacobian, and $n$ indexes a given mini-batch example. By convention, the components of $\eta$ carry lowered indices, $\eta_{\mu\nu}$ , while those of $\eta^{-1}$ carry raised indices, $\eta^{\mu\nu}$ , i.e. $\eta^{\mu\alpha}\eta_{\alpha\nu}=\delta^{\mu}_{\ \nu};$ $\eta$ provides the canonical correspondence between covariant and contravariant components (Absil et al., 2008). Equation 1 balances the force term $dC$ against the kinetic cost of the update distance in the $\eta$ prescribed geometry. Convexity of $\eta$ , together with mild regularity requirements on $C$ , guarantees a unique stationary point of each proximal subproblem. Proximal gradient descent iterates these subproblems to accumulate a sequence of locally improving displacements that drives descent of cost function $C$ . The inverse susceptibility tensor $\eta^{-1}$ serves as a generalized stiffness operator by propagating gradient signals to parameter displacements (Fort et al., 2020).

Classical influence functions can be expressed as the Lie derivative of a scalar; they’re capable of probing local gradient coherence, generalization capabilities, and adversarial sensitivity, in addition to enabling the identification of high-leverage examples. Consider a scalar observable $Q$ and a vector field $V=-\eta^{-1}(dC)$ derived from the proximal objective. The Lie derivative of $Q$ along $V$ is

[TABLE]

where

[TABLE]

and the inner product $\left(\cdot,\cdot\right)$ is performed over feature indices. Hence, when $Q$ is a loss function, Equation 3 reduces to the familiar form of an influence function in deep learning: the infinitesimal response of the loss, expressible as a metric-weighted gradient overlap. Likewise, when $Q$ denotes a model response and $V$ encodes the perturbation to the gradient signal induced by a deformation of the input, Equation 3 reproduces the classical influence-function expression from robust statistics [see Appendix A].

Evidently, the Lie-derivative formulation of response is well defined at any point along the training trajectory, as it depends only on the instantaneous training-flow vector field and the induced local geometry; hence, influence-function analysis of neural networks does not require attainment of a stationary point to expand about. This perspective elevates influence from a static sensitivity relevant only near convergence to a dynamical linear response observable defined throughout optimization.

In the limit of vanishing regularization, $\lambda\rightarrow 0$ , $\Pi$ becomes idempotent, assuming the form of a classical hat matrix. We thus take the view that diagonal elements reflect statistical leverage, quantifying the self-influence of individual training examples, while off-diagonals measure cross-influence, i.e. influence between distinct examples. High leverage scores identify regions of parameter space with strong local curvature or limited redundancy, i.e., points with disproportionately large influence on the global response structure. Furthermore, the response matrix encodes the pairwise overlap of example gradients-effectively probing the local loss landscape by highlighting directions of correlated curvature and shared descent paths. Physical considerations that guide expectations for the structure of $\Pi$ are evident on recognizing that we have so far suppressed feature indices in the model Jacobian. The response matrix $\Pi^{nm}$ tracks how gradients derived from each output feature of prediction $m$ influence each output feature of prediction $n$ , offering an investigative level of detail across spacetime, channel, and class dimensions that remains unexplored. We emphasize that the proximal penalty in Equation 1 sets the geometry of the update and clearly identifies $\Pi$ as the primary object governing to what extent an infinitesimal perturbation in the cost function propagates to an observable, such as the test error or physical consistency of predictions. To avoid materializing $\Pi$ , which has $(128\times 128\times 4\times 48)^{2}$ elements, we consider macroscopic observables: SMSE, in addition to global mass and energy conservation.

5.1 Observables

Our probe of generalization capabilities proceeds by quantifying the coherence of gradients derived from test data cost functions of physical and statistical significance. We introduce three generalized residuals

[TABLE]

where $M$ ( $E$ ) computes the total mass (energy) of its argument, $x$ is the input state that evolves to $y$ , i.e. $y=U[x]$ , where $U$ is defined in Equation 7, and $\hat{y}$ is the neural network approximation to $y$ . Recall that each training example is comprised of pairs $(x,y)\;=\;(s_{t}^{n},\,s_{t+1}^{n})$ of states $s$ sharing a common initial configuration indexed by $n$ , and related by the compressible Euler evolution operator.

Viewed as a coupling matrix over residuals, the diagonal blocks of $\Pi$ recover the usual influence (e.g., how a perturbation in the SMSE affects SMSE itself), while the off-diagonal blocks encode cross-coupling, quantifying how a change in the SMSE residual at one time step or sample is converted into the conservation residual at another time step or sample, and vice versa.

It is useful to introduce following notation for the remaining external indices of the response matrix

[TABLE]

where $n,m$ index trajectories, defined by distinct initial conditions; indices $t$ and $\tau$ specify the time step along said trajectories. $H_{CC}$ gives the change in SMSE due to an SMSE perturbation, while $H_{MC}$ and $H_{EC}$ propagate the effect of gradients derived from SMSE into the physics informed and mass and energy conservation errors, respectively.

We report influence in a standardized form by normalizing with respect to the empirical variance of perturbations within each mini-batch, so that the baseline model-corresponding to unstructured stochastic variability-naturally sets the reference scale to unity. In this normalization, departures from one directly indicate influence beyond what is expected from random mini-batch fluctuations, providing a principled scale for interpreting both amplified self-responses and suppressed cross-responses (Héritier and Ronchetti, 1994; Lu et al., 1997). Matrix elements of $\Pi$ were determined for six different mini-batches, each of which contained three trajectories corresponding to distinct initial conditions, for each seed of each model architecture trained, across two datasets [see Appendix B].

6 Data and Training

We trained neural network surrogate models to approximate the evolution of two-dimensional compressible Euler flows, provided by the PDEGym dataset (Herde et al., 2024). Specifically, we used a dataset that contains three classes of initial conditions, namely, the four quadrant Riemann problem with (CE-RPUI) and without (CE-RP) uncertain interfaces, in addition to the curved Riemann problem (CE-CRP). This data is particularly valuable for studying the progression from a linear wave regime with discontinuities to fully developed turbulence, a crossover that poses computational and analytical challenges due to the presence of sharp wave-fronts and emergent nonlinear interactions. While the CE flows exhibit comparable large-scale structures, they also display qualitatively distinct behaviors. In particular, CE-RPUI initial configurations give rise to complex finger-like instabilities in the flow field that are absent or less pronounced in both CE-RP and CE-CRP. In total, we used $6,500$ trajectories for each of the three classes of initial conditions; for each trajectory, we used the first $16$ time steps, for total of approximately $110,000$ training pairs requiring greater than $150$ GB memory.

Since instantaneous flow states alone cannot distinguish viscous Navier-Stokes flows from their inviscid Euler counterparts, we do not combine compressible Euler data with Navier-Stokes data. Furthermore, rather than representing a fluid state using the velocities and pressure in addition to density, as was done by Poseidon (Herde et al., 2024), we used the momentum and energy fields; we expect that this setup will better facilitate the learning of all four conservation laws.

Each snapshot of the flow state at discrete time $t$ is represented as a set of spatially discretized fields $\rho_{\text{mass}},\rho^{i}_{\text{mom}},\rho_{\text{energy}}$ on a uniform grid of size $128\times 128$ , where $\rho_{\text{mass}}$ denotes mass density, $\rho^{i}_{\text{mom}}$ are the Cartesian components of momentum density, and $\rho_{\text{energy}}$ is energy density. The model $\hat{y}_{\theta}$ is trained to emulate the compressible Euler evolution, i.e., $\hat{y}_{\theta}\approx U$ , where the operator $U$ enacts

[TABLE]

with $s_{t}$ the collection of state variables at timestep $t$ , via optimization of the weights $\theta$ . Specifically, we used the Adam optimizer (Kingma and Ba, 2017) with learning rate $5\times 10^{-4}$ and weight decay $\lambda=10^{-4}$ to minimize a scaled mean squared error (SMSE) between predicted and true states

[TABLE]

on mini-batches containing $N=48$ transitions $s_{t}\to s_{t+1}$ , chosen randomly from the training set; here, the $L_{2}$ norm is computed over the spatial degrees of freedom and the channel index $c$ runs through mass, both cartesian components of the momentum, and energy, respectively. $\mathrm{RMS}(s^{c})$ is computed channel‐wise by taking the spatial root‐mean‐square of $s^{c}$ . Thus, Equation 8 strikes a balance between ordinary and relative mean‐squared‐error; normalizing each channel’s squared error by the target fields’ characteristic amplitude favors examples containing pronounced features, but does not completely drown out gradients derived from relatively quiescent flows, thereby facilitating accurate capture of high-energy shocks and wavefronts without harsh under-emphasis of small-amplitude features. Moreover, the scaling of Equation 8 renders dimensionless the matrix elements of interest to this work.

In order to compare the results of our experiments across model architectures, we trained both a UNet (Ronneberger et al., 2015) and a vision transformer (ViT) (Vaswani et al., 2023; Liu et al., 2021; Dosovitskiy et al., 2021). Our UNet was based on BigGAN (Brock et al., 2019), with four down-sampling blocks and $24$ channels after the initial embedding layer, for a total of about $13$ -million parameters. Our vision transformer was of layer-depth six, with $256$ channels, for a total of about $5$ -million parameters, fewer than our UNet due to memory constraints on the $40$ GB A100s that we used for training. To supplement our compressible Euler study, we repeat our analysis for velocity fields corresponding to solutions of the Navier-Stokes equations with NS-BB, NS-Gauss, and NS-Sines initial conditions, which presents a distinct feature space and flow morphology, involving smoother, viscosity-regularized transport with vorticity-dominated structure [see Figure 11].

The validation losses for each of our CE-tasked models is shown in Figure 9. Despite possessing fewer parameters, our ViT model consistently outperformed the UNet. Each of the two architectures was trained three times, sharing those three seeds that controlled initialization and dataset split. The training of each model was performed in distributed mode across two such A100s.

7 Conclusion

Our research reveals a critical shortcoming common to physics-agnostic PDE emulators: it is not an immediate consequence of large-scale multi-scenario training that the resulting trained model can satisfy those stringent expectations that follow from the governing equations one is trying to emulate. Physical expectations demand a shared feature basis across trajectories, yet our results reveal a failure of both UNets and ViTs to support a nontrivial off-diagonal response. This mismatch underscores the importance of enforcing principled physics-based constraints, either as weak regularizers during training, or baked in strictly through architectural design. By measuring gradient overlap between classes of initial conditions we reveal an absence of coherent gradients, which suggests limited learning of robust, transferable physics. This demonstrates that both ViT and UNet surrogates embed these solution classes on nearly disjoint manifolds, challenging the efficacy of current multi‑scenario training pipelines.

We demonstrate that influence functions form a versatile diagnostic framework and demonstrate their effectiveness in revealing the degree of balance between memorization and generalization in autoregressive predictors. This analysis suggests that ordinary data-driven PDE emulators behave as statistical estimators, producing predictions primarily based on those training examples that lie within a neighborhood of the input query. While this localized learning mechanism provides resilience against noisy data, it also restricts generalization, and indicates that the learned data manifold geometry is composed of largely isolated regions.

In summary, we highlight a new, concrete, and targetable characteristics—time and class aware cross-influence—to guide researchers in designing algorithms capable of learning the underlying generative process and achieving reliable long-term rollouts.

8 Electronic Submission

Software and Data

We trained our models on the openly available dataset PDEGym (Herde et al., 2024) using Lux.jl (Pal, 2023a, b), with Zygote.jl as our auto-differentiation backend (Innes, 2018). Plots in this manuscript were generated using Makie.jl (Danisch and Krumbiegel, 2021).

The code used in this work is publicly available at https://github.com/lanl/PDEHats. Additionally, trained models and gradient data are available from the authors upon reasonable request.

Acknowledgements

Research presented in this report was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number(s) 20250637DI, 20250638DI, and 20250639DI. This research used resources provided by the Los Alamos National Laboratory Institutional Computing Program, which is supported by the U.S. Department of Energy National Nuclear Security Administration under Contract No. 89233218CNA000001. It is published under LA-UR-25-28084.

Impact Statement

This work contributes a diagnostic framework for distinguishing mere memorization from genuine generalization in autoregressive surrogate models, with particular relevance to scientific and engineering applications where model failures can have downstream consequences. By exposing training-dynamics limitations that are not visible through standard accuracy metrics, our analysis helps identify when learned surrogates are likely to be brittle under long-horizon rollout or distribution shift, informing safer deployment in settings such as climate modeling, fluid dynamics, and materials simulation. The methods introduced here are diagnostic rather than prescriptive and do not directly enable new capabilities for misuse; instead, they promote transparency and reliability by clarifying when and why models fail to internalize shared physical structure. More broadly, our research encourages the development of learning algorithms and evaluation practices that prioritize robustness and interpretability, supporting the responsible use of machine learning in high-consequence scientific workflows.

Appendix A Hat Matrix

To obtain the related expression familiar from classical influence function theory, let $Q=\hat{y}$ and

[TABLE]

where $\delta y$ is a target feature variation. Then

[TABLE]

when $C$ is the mean squared error cost, ${\delta\hat{y}}/{\delta y}=\Pi.$ Equation 9 allows for investigating the influence of both physics-informed and numerical-routine aware data modifications, which we save for future work.

Appendix B Determination of $\eta^{-1}$

In practice, we include an $\ell_{2}$ -type regularizer corresponding to the weight-decay term used in AdamW driven training (Kingma and Ba, 2017; Loshchilov and Hutter, 2019). Although this penalty is not intrinsic to the model geometry—being defined with respect the ambient Euclidean coordinates—it remains a useful extrinsic regularizer when viewing the parameter manifold as embedded in a product of real coordinate spaces. Concretely, we consider the regularized metric

[TABLE]

with weight decay $\lambda$ providing mass to the zero modes of $\eta$ , thereby weakly lifting its flat directions.

We apply an iterative matrix-free solver, specifically the CRAIG method (Orban and Arioli, 2017) provided by the Krylov.jl package (Montoison and Orban, 2023), which is formally equivalent to conjugate gradient descent, to efficiently approximate the required sensitivities. A direct inversion to determine $\eta$ is not computationally feasible because of the large number of trainable parameters in our models. While this approach does not leverage commonly used scalable approximations (George, 2021; TransferLab, 2024), such approximations do not provide error control. When evaluating our UNet models, we determine the action of $\eta$ with a relative error tolerance of $1.5\times 10^{-2}$ . We reach similar absolute error for our ViT models on using a relative tolerance of $5\times 10^{-2}$ ; our ViTs have both fewer parameters and smaller dominant NTK eigenvales than our UNets [see Figure 8 and Figure 35].

Appendix C Compressible Euler

The compressible Euler equations in two-spatial dimensions can be expressed in terms of four continuity equations, each of which is of the form

[TABLE]

where $\rho_{c}$ is a conserved density, $\mathbf{J}_{c}$ is the associated conserved current, and $c$ designates mass, momentum, and energy. Equation 12 follows directly from symmetry arguments: invariance under time translation yields energy conservation, spatial translation invariances implies momentum conservation, and an underlying global phase symmetry provides mass conservation. When Equation 12 is defined with periodic boundary conditions, the volume integrals of the conserved densities remain exact invariants for all time. Thus, both the local continuity relations and their associated global constraints must be respected: the domain-integrated mass, the two Cartesian components of momentum, and the total energy may not drift at any time during the rollout. Any surrogate or reduced-order model that aspires to physical fidelity must therefore honour these integral invariants, in addition to satisfying the differential conservation laws Equation 12.

C.1 Integral Invariants

For definiteness, we show that mass, momentum, and energy are conserved in this system. To this end, recall that

[TABLE]

where $p$ is the pressure and $\delta_{ij}$ is the Kronecker delta. Having introduced pressure as a fifth dynamical variable, a constitutive relation is needed in order to arrive at a closed system of equations, which is achieved on writing the energy density as

[TABLE]

where the specific internal energy $e$ is related to the pressure, and using the ideal-gas law

[TABLE]

with $\gamma=1.4$ the adiabatic index of a diatomic gas.

While the continuity equations control the pointwise evolution of the densities, global conservation guarantees that the total amount of each conserved quantity is invariant under the flow map. Integrating the continuity equation for any density $\rho_{c}$ over the periodic domain $\Omega$ and applying integration by parts (or, equivalently, the divergence theorem) yields

[TABLE]

Therefore,

[TABLE]

is an integral of motion. This statement precludes secular drift of mass, momentum, or energy in long simulations, and serves as a primary evaluation metric for surrogate models. Note that the Navier-Stokes equations can also be expressed in the form Equation 12 and therefore also admit mass, momentum, and energy as conserved variables. However, in order to sensibly train a neural network on both Compressible Euler and Navier-Stokes, one should attach to the model input an indication of whether or not $J_{\text{mom}}$ contains a viscous term.

Appendix D Supplementary Figures

D.1 Hat Matrix

D.2 Rollout Predictions

Bibliography60

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1P.-A. Absil, R. Mahony, and R. Sepulchre (2008) Optimization algorithms on matrix manifolds . Princeton University Press , Princeton, NJ . External Links: ISBN 978-0-691-13298-3 Cited by: §5 .
2Anonymous (2025) Measuring model robustness via fisher information: spectral bounds, theoretical guarantees, and practical algorithms . In Submitted to The Fourteenth International Conference on Learning Representations , Note: under review External Links: Link Cited by: item 3 .
3J. Bae, N. Ng, A. Lo, M. Ghassemi, and R. Grosse (2022) If influence functions are the answer, then what is the question? . External Links: 2209.05364 , Link Cited by: §1 , §1 , §2 , §5 .
4I. Batatia, P. Benner, Y. Chiang, A. M. Elena, D. P. Kovács, J. Riebesell, X. R. Advincula, M. Asta, M. Avaylon, W. J. Baldwin, F. Berger, N. Bernstein, A. Bhowmik, S. M. Blau, V. Cărare, J. P. Darby, S. De, F. D. Pia, V. L. Deringer, R. Elijošius, Z. El-Machachi, F. Falcioni, E. Fako, A. C. Ferrari, A. Genreith-Schriever, J. George, R. E. A. Goodall, C. P. Grey, P. Grigorev, S. Han, W. Handley, H. H. Heenen, K. Hermansson, C. Holm, J. Jaafar, S. Hofmann, K. S. Jakob, H. Jung, V. Kapil, A. D. Ka
5S. Behpour, T. Doan, X. Li, W. He, L. Gou, and L. Ren (2023) Grad Orth: a simple yet efficient out-of-distribution detection with orthogonal projection of gradients . External Links: 2308.00310 , Link Cited by: §2 .
6C. Bodnar, W. P. Bruinsma, A. Lucic, M. Stanley, A. Vaughan, J. Brandstetter, P. Garvan, M. Riechert, J. A. Weyn, H. Dong, J. K. Gupta, K. Thambiratnam, A. T. Archibald, C. Wu, E. Heider, M. Welling, R. E. Turner, and P. Perdikaris (2024) A foundation model for the earth system . External Links: 2405.13063 , Link Cited by: §1 .
7J. Brandstetter, D. Worrall, and M. Welling (2023) Message passing neural pde solvers . External Links: 2202.03376 , Link Cited by: §1 .
8A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis . External Links: 1809.11096 , Link Cited by: §6 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Generalization vs. Memorization in Autoregressive Deep Learning:

Abstract

1 Introduction

2 Related Work

3 Our Contributions

4 Results

5 Proximal Response Function

5.1 Observables

6 Data and Training

7 Conclusion

8 Electronic Submission

Software and Data

Acknowledgements

Impact Statement

Appendix A Hat Matrix

Appendix B Determination of η−1\eta^{-1}η−1

Appendix C Compressible Euler

C.1 Integral Invariants

Appendix D Supplementary Figures

D.1 Hat Matrix

D.2 Rollout Predictions

Appendix B Determination of $\eta^{-1}$