Communication-Inspired Tokenization for Structured Image Representations

Aram Davtyan; Yusuf Sahin; Yasaman Haghighi; Sebastian Stapf; Pablo Acuaviva; Alexandre Alahi; Paolo Favaro

arXiv:2602.20731·cs.CV·February 25, 2026

Communication-Inspired Tokenization for Structured Image Representations

Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro

PDF

Open Access 3 Models

TL;DR

This paper introduces COMiT, a communication-inspired tokenization framework that creates structured, object-centric image representations through iterative, sequential encoding, enhancing interpretability and reasoning in vision models.

Contribution

The paper proposes a novel, communication-inspired tokenization method that constructs structured, object-centric image representations via iterative encoding within a transformer architecture.

Findings

01

Improves object-centric interpretability of image tokens.

02

Enhances compositional generalization and relational reasoning.

03

Outperforms prior methods in structured visual representation tasks.

Abstract

Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Face Recognition and Perception