The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Prashant C. Raju

arXiv:2604.04155·cs.LG·April 7, 2026

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models

Prashant C. Raju

PDF

1 Repo

TL;DR

This paper investigates how tokenization and discrete representations in scientific foundation models hinder the preservation of continuous geometric structures, revealing the root causes and proposing insights into model limitations.

Contribution

It introduces the concept of the Geometric Alignment Tax as a fundamental obstacle and provides empirical evidence on how different architectures and objectives impact geometric fidelity.

Findings

01

Replacing cross-entropy with a continuous head reduces geometric distortion by up to 8.5x.

02

Finer quantization can worsen geometry despite better reconstruction.

03

No model simultaneously achieves low distortion, high mutual information, and global coherence.

Abstract

Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenization, they diverge by 3,000x. Evaluating 14 biological foundation models with rate-distortion theory and MINE, we identify three failure regimes:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

prashantcraju/geometric-alignment-tax
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.