On the Value of Tokeniser Pretraining in Physics Foundation Models
Hadi Sotoudeh, Payel Mukhopadhyay, Ruben Ohana, Michael McCabe, Neil D. Lawrence, Shirley Ho, Miles Cranmer

TL;DR
Pretraining tokenisers with autoencoding significantly improves the efficiency and accuracy of physics foundation models, especially when pretraining data aligns with the target domain, and introduces adaptable compression techniques for diverse tasks.
Contribution
This paper systematically investigates the benefits of tokeniser pretraining in physics models and introduces flexible compression operations for better task adaptation.
Findings
In-domain tokeniser pretraining reduces VRMSE by 64%.
Pretraining on different systems yields moderate gains.
Pretraining enhances computational efficiency in physics emulation.
Abstract
We investigate the impact of tokeniser pretraining on the accuracy and efficiency of physics emulation. Modern high-resolution simulations produce vast volumes of data spanning diverse physical regimes and scales. Training foundation models to learn the dynamics underlying such data enables the modelling of complex multiphysics phenomena, especially in data-limited settings. The emerging class of physics foundation models typically aims to learn two tasks jointly: (i) extracting compact representations of high-resolution spatiotemporal data, and (ii) capturing governing physical dynamics. However, learning both tasks from scratch simultaneously can impede the effectiveness of either process. We show that pretraining the tokeniser with an autoencoding objective prior to training the dynamics model enhances computational efficiency for physics emulation. Notably, the magnitude of this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis · Parallel Computing and Optimization Techniques
