TL;DR
PhysiX is a large-scale foundation model for physics simulation that leverages discrete tokenization and autoregressive modeling to overcome data scarcity and improve long-range prediction across diverse physical tasks.
Contribution
This paper introduces PhysiX, the first large-scale physics foundation model with 4.5B parameters, employing a novel discrete tokenizer and refinement module to enhance physics simulation capabilities.
Findings
Outperforms task-specific baselines on The Well benchmark
Effectively transfers knowledge from natural videos to physics simulation
Joint training across tasks enables synergistic learning
Abstract
Foundation models have achieved remarkable success across video, image, and language domains. By scaling up the number of parameters and training datasets, these models acquire generalizable world knowledge and often surpass task-specific approaches. However, such progress has yet to extend to the domain of physics simulation. A primary bottleneck is data scarcity: while millions of images, videos, and textual resources are readily available on the internet, the largest physics simulation datasets contain only tens of thousands of samples. This data limitation hinders the use of large models, as overfitting becomes a major concern. As a result, physics applications typically rely on small models, which struggle with long-range prediction due to limited context understanding. Additionally, unlike images, videos, or text-which typically exhibit fixed granularity-physics datasets often…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The work addresses an important research topic. Developing a foundation model that generalizes across physical systems would have great impact on the sciML community. 2. The technical designs are reasonable, for example, temporal causality is enforced through causal padding.
1. The primary concern is that the contribution of the work appears limited. Algorithmically, it combines established components rather than introducing fundamentally novel ideas. In terms of pretraining generalizability, the model is trained on only eight 2D physical systems from The Well. Similar efforts, such as MPP and Poseidon, already exist. 2. Key model architecture hyperparameters are not reported, which makes it hard for reproducibility. Details regarding the training and testing proced
* The proposed method handles diverse simulation tasks in a unified framework. * It outperforms the baseline methods in most cases.
1. Missing long-horizon visualized results. The supplementary material visualizes at most 24 rollout steps, whereas Table 3 reports predictions up to 56 frames. Please provide additional qualitative results for long-term predictions (e.g., full-trajectory videos or densely sampled frames) to assess temporal faithfulness and smoothness. Quantitative metrics for long-horizon predictions alone are insufficient to evaluate dynamic fidelity. 2. Clarify “long-horizon” scope and report full-length roll
The experiments are fairly complete, considering next-frame predictions, long-timeframe predictions, single and multi-task models, the effect of refinement, scaling across different model sizes, and generalization on unseen tasks. The results on the Well are quite strong, improving not only against previous approaches, but continuing to show stronger evidence of generalizability across tasks/PDEs.
At times, the discussion of Refinement Module makes it seem like a super-resolution task, going from a coarse prediction to a higher-resolution one? And it seems like this is true at least in terms of precision but Figure 2 doesn’t give me the impression that it is predicting a longer token sequence than the autoregressive model did. Compared with the decomposed attention in MPP, other studies have previously suggested advantages to using a unified autoregressive transformer versus decomposing
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
