Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
Yakov Pyotr Shkolnikov

TL;DR
This paper demonstrates that vision-language models encode detailed geometric information in their frozen features, which can be extracted with simple probes, revealing a pathway-training deficit and functional convergence across architectures.
Contribution
It shows that geometric knowledge is present in frozen features and can be accessed with lightweight probes, challenging the need for fine-tuning or text-based outputs.
Findings
Linear probes extract hand joint angles with 6.1° MAE from frozen features.
Fine-tuning narrows the gap to 6.5° MAE, indicating a pathway-training deficit.
Different training objectives lead to similar geometric accuracy despite low representational similarity.
Abstract
Vision-language models encode continuous geometry that their text pathway fails to express: a 6,000-parameter linear probe extracts hand joint angles at 6.1 degrees MAE from frozen features, while the best text output achieves only 20.0 degrees -- a 3.3x bottleneck. LoRA fine-tuning (r=16, 2,000 images) narrows this gap to 6.5 degrees, providing evidence for a pathway-training deficit rather than a representational one. Training objective determines accuracy more than architecture: five encoders spanning self-supervised, contrastive, and hybrid paradigms converge to statistically equivalent accuracy (R^2 approximately 0.55, TOST-equivalent at delta=0.03) despite sharing as little as CKA=0.41 representational similarity -- functional convergence without representational convergence. Autoregressive generation damages geometric fidelity, but the damage originates in the generation process,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInteractive and Immersive Displays · Tactile and Sensory Interactions · Handwritten Text Recognition Techniques
