Learning Sparse Visual Representations via Spatial-Semantic Factorization

Theodore Zhengde Zhao; Sid Kiblawi; Jianwei Yang; Naoto Usuyama; Reuben Tan; Noel C Codella; Tristan Naumann; Hoifung Poon; Mu Wei

arXiv:2602.01905·cs.CV·February 3, 2026

Learning Sparse Visual Representations via Spatial-Semantic Factorization

Theodore Zhengde Zhao, Sid Kiblawi, Jianwei Yang, Naoto Usuyama, Reuben Tan, Noel C Codella, Tristan Naumann, Hoifung Poon, Mu Wei

PDF

Open Access

TL;DR

STELLAR introduces a novel framework that factorizes visual features into semantic concepts and spatial distributions, enabling sparse tokens to achieve both high-quality reconstruction and strong semantic understanding in self-supervised learning.

Contribution

It proposes a new factorization method that disentangles semantics and spatial information, bridging the gap between discriminative and generative SSL methods.

Findings

01

As few as 16 sparse tokens support high-quality reconstruction (FID 2.60).

02

Achieves 79.10% ImageNet accuracy with sparse tokens, matching dense backbone performance.

03

Demonstrates effective semantic and spatial feature disentanglement in SSL.

Abstract

Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications