Visual Language Hypothesis

Xiu Li

arXiv:2512.23335·cs.CV·January 1, 2026

Visual Language Hypothesis

Xiu Li

PDF

Open Access

TL;DR

This paper proposes a topological framework for visual representation learning, suggesting that understanding requires a semantic language and specific structural model architecture to capture the organization of visual observations.

Contribution

It introduces a novel topological perspective on visual understanding, linking semantic invariance to the structure of the observation space and model architecture requirements.

Findings

01

Visual observation space has a fiber bundle structure with nuisance and semantic components.

02

Semantic invariance necessitates non-smooth, discriminative targets like labels or multimodal alignment.

03

Model architecture must support topology change through expand and snap processes.

Abstract

We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient X/G is not a submanifold of X and cannot be obtained through smooth deformation alone, semantic invariance requires a non homeomorphic, discriminative target for example, supervision via labels, cross-instance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopological and Geometric Data Analysis · Child and Animal Learning Development · Face Recognition and Perception