Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

Liu hung ming

arXiv:2603.20327·cs.LG·March 24, 2026

Probing the Latent World: Emergent Discrete Symbols and Physical Structure in Latent Representations

Liu hung ming

PDF

Open Access

TL;DR

This paper introduces AIM, a passive quantization probe that reveals structured symbolic representations in the latent space of video world models trained with V-JEPA 2, highlighting their potential for interpretable physical understanding.

Contribution

The paper presents AIM, a novel method for extracting discrete symbols from frozen latent representations, enabling interpretability without retraining or supervision.

Findings

01

V-JEPA 2 latent space encodes physical structures as distributional variations.

02

Symbol distributions differ significantly across physical categories.

03

Latent space is compact with shared core representations across actions.

Abstract

Video world models trained with Joint Embedding Predictive Architectures (JEPA) acquire rich spatiotemporal representations by predicting masked regions in latent space rather than reconstructing pixels. This removes the visual verification pathway of generative models, creating a structural interpretability gap: the encoder has learned physical structure inaccessible in any inspectable form. Existing probing methods either operate in continuous space without a structured intermediate layer, or attach generative components whose parameters confound attribution of behavior to the encoder. We propose the AI Mother Tongue (AIM) framework as a passive quantization probe: a lightweight, vocabulary-free probe that converts V-JEPA 2 continuous latent vectors into discrete symbol sequences without task-specific supervision or modifying the encoder. Because the encoder is kept completely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning · Human Pose and Action Recognition