TL;DR
Overtone introduces cyclic patch modulation for PDE surrogates, reducing harmonic errors and enabling flexible, compute-adaptive inference, achieving up to 40% lower long-term error compared to static patch models.
Contribution
It presents a novel cyclic patch modulation technique with dynamic patch size control, improving accuracy and flexibility of physics emulators.
Findings
Up to 40% reduction in long rollout error.
Matches or exceeds fixed-patch baselines across benchmarks.
Enables dynamic trade-offs between accuracy and speed.
Abstract
Transformer-based PDE surrogates achieve remarkable performance but face two key challenges: fixed patch sizes cause systematic error accumulation at harmonic frequencies, and computational costs remain inflexible regardless of problem complexity or available resources. We introduce Overtone, a unified solution through dynamic patch size control at inference. Overtone's key insight is that cyclically modulating patch sizes during autoregressive rollouts distributes errors across the frequency spectrum, mitigating the systematic harmonic artifact accumulation that plague fixed-patch models. We implement this through two architecture-agnostic modules--CSM (using dynamic stride modulation) and CKM (using dynamic kernel resizing)--that together provide both harmonic mitigation and compute-adaptive deployment. This flexible tokenization lets users trade accuracy for speed dynamically based…
Peer Reviews
Decision·ICLR 2026 Poster
- The authors analyzed the phenomenon in Transformer-based PDE prediction where fixed patch grids lead to error accumulation at specific harmonic frequencies, which manifests as spatial artifacts. To address this, they adopt strategies common in computer vision, incorporating multi-size kernels and multi-strides within a single model, thereby mitigating this problem in ViTs and achieving modest performance improvements. - The authors conduct evaluations on multiple PDE datasets from *The Well*
- Insufficient Baselines: The paper claims flexibility but lacks critical comparisons against existing ViT variants that also process multi-scale information (e.g., U-ViT or Swin-Unet), making Overtone's relative advantages unclear. The authors are expected to add direct performance comparisons against these relevant baselines in their revision. - Limited Novelty: The core modules (CKM and CSM) are not novel. CKM is heavily based on prior work such as FlexiViT, and CSM (stride modulation) is a
* The paper's core strength lies in identifying and articulating the problem of harmonic error accumulation due to temporal coherence. This is a subtle but important issue that has been overlooked. The proposed cyclic modulation strategy is a simple and effective solution that directly addresses this root cause, and the empirical results demonstrate its success. * The authors conduct a thorough and rigorous experimental evaluation. The use of diverse and challenging 2D/3D datasets from The
While the method is empirically successful, there are several areas of concern that detract from the paper's overall quality. * The theoretical analysis in Section 2 and Appendix A provides some solid insights. But I don't quite understand how changing patch grid temporally thins and phase-misaligns injections which then shifts a portion of the error growth from quadratic to linear. (line 847~852) * A major concern is the physical intuition behind cyclically modulating the input representati
- The authors notice and theoretically analyze an interesting problem, that is, systematic harmonic artifacts in the rollout of fixed patch vision transformers. - The proposed method is verified to be effective in long-term rollout. - Sufficient implementation details are included.
Despite the above strengths, I think this paper has some implementation errors, which may cause the experimental results to be meaningless. ### (1) Potential implementation error. In canonical vision transformer (ViT) and its follow-ups, such as Swin Transformer, patch embedding is implemented as a flattening and linear projection. In contrast, this paper adopts convolutional layers for patch embedding and decoding. I do not think this is a correct implementation. Thus, I think the proposed m
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
