On the Value of Tokeniser Pretraining in Physics Foundation Models

Hadi Sotoudeh; Payel Mukhopadhyay; Ruben Ohana; Michael McCabe; Neil D. Lawrence; Shirley Ho; Miles Cranmer

arXiv:2603.05598·cs.LG·March 13, 2026

On the Value of Tokeniser Pretraining in Physics Foundation Models

Hadi Sotoudeh, Payel Mukhopadhyay, Ruben Ohana, Michael McCabe, Neil D. Lawrence, Shirley Ho, Miles Cranmer

PDF

Open Access

TL;DR

Pretraining tokenisers with autoencoding significantly improves the efficiency and accuracy of physics foundation models, especially when pretraining data aligns with the target domain, and introduces adaptable compression techniques for diverse tasks.

Contribution

This paper systematically investigates the benefits of tokeniser pretraining in physics models and introduces flexible compression operations for better task adaptation.

Findings

01

In-domain tokeniser pretraining reduces VRMSE by 64%.

02

Pretraining on different systems yields moderate gains.

03

Pretraining enhances computational efficiency in physics emulation.

Abstract

We investigate the impact of tokeniser pretraining on the accuracy and efficiency of physics emulation. Modern high-resolution simulations produce vast volumes of data spanning diverse physical regimes and scales. Training foundation models to learn the dynamics underlying such data enables the modelling of complex multiphysics phenomena, especially in data-limited settings. The emerging class of physics foundation models typically aims to learn two tasks jointly: (i) extracting compact representations of high-resolution spatiotemporal data, and (ii) capturing governing physical dynamics. However, learning both tasks from scratch simultaneously can impede the effectiveness of either process. We show that pretraining the tokeniser with an autoencoding objective prior to training the dynamics model enhances computational efficiency for physics emulation. Notably, the magnitude of this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Generative Adversarial Networks and Image Synthesis · Parallel Computing and Optimization Techniques