Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations
Liu hung ming

TL;DR
This paper introduces AIM, a passive quantization probe that reveals structured symbolic representations in the latent space of video world models trained with V-JEPA 2, highlighting their potential for interpretable physical understanding.
Contribution
The paper presents AIM, a novel method for extracting discrete symbols from frozen latent representations, enabling interpretability without retraining or supervision.
Findings
V-JEPA 2 latent space encodes physical structures as distributional variations.
Symbol distributions differ significantly across physical categories.
Latent space is compact with shared core representations across actions.
Abstract
Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning · Human Pose and Action Recognition
