Efficient Generative Transformer Operators For Million-Point PDEs
Armand Kassa\"i Koupa\"i, Lise Le Boudec, Patrick Gallinari

TL;DR
ECHO is a novel transformer-based framework that efficiently generates high-resolution PDE trajectories on dense grids, overcoming scalability and error accumulation issues of previous neural operators.
Contribution
The paper introduces ECHO, a hierarchical convolutional transformer architecture with a new training strategy for scalable, high-fidelity PDE trajectory generation from sparse inputs.
Findings
Achieves 100x spatio-temporal compression while maintaining fidelity.
Enables high-resolution PDE solutions from sparse input data.
Demonstrates state-of-the-art results on complex, long-horizon PDE simulations.
Abstract
We introduce ECHO, a transformer-operator framework for generating million-point PDE trajectories. While existing neural operators (NOs) have shown promise for solving partial differential equations, they remain limited in practice due to poor scalability on dense grids, error accumulation during dynamic unrolling, and task-specific design. ECHO addresses these challenges through three key innovations. (i) It employs a hierarchical convolutional encode-decode architecture that achieves a 100 spatio-temporal compression while preserving fidelity on mesh points. (ii) It incorporates a training and adaptation strategy that enables high-resolution PDE solution generation from sparse input grids. (iii) It adopts a generative modeling paradigm that learns complete trajectory segments, mitigating long-horizon error drift. The training strategy decouples representation learning from…
Peer Reviews
Decision·Submitted to ICLR 2026
1. It employs a compression-decompression architecture, supporting the solution of PDEs with millions of points. 2. The designed encoder can map irregular grids in physical space to regular grids in latent space. 3. Using a DiT-based generative approach, the trained model can simultaneously support multiple tasks such as forward solving, inverse solving, and interpolation. 4. By generating complete trajectory segments, it effectively mitigates long-term error accumulation.
1. One of the contributions of the paper is its support for predictions with millions of spatial points. However, the compression process of spatial points in the encoder and decoder is not described in detail. Appendix C.2 mentions that the authors adopted structures similar to (Hagnberger et al., 2025), (Yu et al., 2023), and (Koupaï et al., 2025). So, what are the innovative aspects of the authors' work in spatiotemporal compression and decompression that distinguish it from these references?
1. The paper is clearly written, which makes the method and implementation easy to follow. 2. Consistently lower MSE compared against other models across diverse PDE benchmarks that contains relatively large-scale problem. 3. The paper indicates that hierarchical convolutional yields lower reconstruction error at a fixed latent size than non-conv alternatives (graph, INR, and transformer-AE).
1. While the system is well-executed, many components are established (conv encoder/decoder, regular latent grid, DiT-style transformer, flow matching/latent diffusion). Related efforts already explore latent generative PDE solvers—e.g., [1] conv AE + structured latent grid with flow-matching DiT, [2] latent diffusion generating full trajectories at once, [3] autoregressive latent video diffusion. A clearer exposition of the main differences would strengthen the paper’s contribution [1] Li, Z
1. The chosen problem scope is broad, showing potential for ECHO to serve as a foundation model for PDE solving. 2. Extensive experiments are conducted, which appear quite thorough. The experiments at 1024-resolution are particularly impressive. 3. The two-stage encoding design is well-motivated, as standard VAEs fail to compress long trajectories at high resolution effectively.
1. Regarding the latent grid mapping: how does the use of continuous convolution differ from constructing a nearest-neighbor graph and applying a GNN? A clarification or comparison would strengthen the methodological justification. 2. Is the assumption of a uniform latent grid always reasonable? For example, if input points are distributed on a 3D spherical surface, a uniform grid would significantly increase the number of points to process. How does the method avoid unnecessary computations for
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis · Tensor decomposition and applications
