TL;DR
This paper investigates how tokenization and discrete representations in scientific foundation models hinder the preservation of continuous geometric structures, revealing the root causes and proposing insights into model limitations.
Contribution
It introduces the concept of the Geometric Alignment Tax as a fundamental obstacle and provides empirical evidence on how different architectures and objectives impact geometric fidelity.
Findings
Replacing cross-entropy with a continuous head reduces geometric distortion by up to 8.5x.
Finer quantization can worsen geometry despite better reconstruction.
No model simultaneously achieves low distortion, high mutual information, and global coherence.
Abstract
Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
