CViT: Continuous Vision Transformer for Operator Learning
Sifan Wang, Jacob H Seidman, Shyam Sankaran, Hanwen Wang, George J., Pappas, Paris Perdikaris

TL;DR
CViT introduces a novel neural operator architecture combining vision transformers with grid-based embeddings and cross-attention, enabling flexible, high-performance learning of complex physical systems across various PDEs.
Contribution
The paper presents CViT, a new neural operator architecture that adapts vision transformer techniques for operator learning in physical sciences, demonstrating superior performance without extensive pretraining.
Findings
Achieves state-of-the-art results on multiple PDE benchmarks.
Effectively handles discontinuities and multi-scale features.
Outperforms larger models with less pretraining.
Abstract
Operator learning, which aims to approximate maps between infinite-dimensional function spaces, is an important area in scientific machine learning with applications across various physical domains. Here we introduce the Continuous Vision Transformer (CViT), a novel neural operator architecture that leverages advances in computer vision to address challenges in learning complex physical systems. CViT combines a vision transformer encoder, a novel grid-based coordinate embedding, and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. This design allows for flexible output representations and consistent evaluation at arbitrary resolutions. We demonstrate CViT's effectiveness across a diverse range of partial differential equation (PDE) systems, including fluid dynamics, climate modeling, and reaction-diffusion processes. Our comprehensive experiments…
Peer Reviews
Decision·ICLR 2025 Poster
- The proposed coordinate embedding for the query is interesting, and is effective in the continuous ViT learning, compared to MLP and RFF. Also, it is simple and easy to control through the interpolation parameter $\beta$. - Discussions and analysis in the paper are extensive and interesting. Overall, the paper is well-organized and easy to follow. - The proposed coordinate embedding could have the potential to apply to more general continuous learning beyond physical domains. Especially for th
- Why not do coordinate embedding for all query, key, value? Only doing coordinate embedding for query preferable in which way? Maybe some further analysis on this could be included to further highlight the effectiveness of the proposed coordinate embedding. - The first question brought to the second one, that in the paper (appendix), the authors discussed the Lipschitz constant for different embeddings from linear embedding to random Fourier features (RFF), and to the proposed coordinate embedd
This paper is well-written, easy to read, and technically sound. The proposed approach appears to scale more efficiently than the current baseline thanks to the perceiver-inspired architecture. The results seem also promising.
* No clear structure of the related work section. It would help the reader to add more justifications on why transformer approaches struggle with various resolutions. Even if I tend to agree with the conclusion, it is also not clear, why a novel architecture design is required to solve the limitations related to high dimensional data and long-range dependencies. * The transformer architecture in Fig. 1. does not give any insight and is well known. IMO, Authors could either remove it or replace
I found this paper very interesting. Originality: I think this is a very good variation of Vision Transformers which significantly expands their capabilities. I found the interpolated readout mechanism of particular novelty. Quality: This is a well executed paper. There is ample experimental validation using several different datasets. The baselines chosen (as far as I can tell) are diverse and strong. There is a good set of ablation experiments and deep analysis of the method in the appendix.
All in all I think this is a strong paper, however: * Other domains? I would have loved to see if the resulting method is applicable to other domain or training setups beyond L2 prediction of physical systems. Does it work, for example, on natural video? Would it be useful as a diffusion model back-bone? * More analysis of cross attention - one of the more interesting parts is the use of cross attention, both in the encoding and decoding. It would be nice to see visualizations of cross attenti
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout · Softmax · Balanced Selection
