Communication-Inspired Tokenization for Structured Image Representations
Aram Davtyan, Yusuf Sahin, Yasaman Haghighi, Sebastian Stapf, Pablo Acuaviva, Alexandre Alahi, Paolo Favaro

TL;DR
This paper introduces COMiT, a communication-inspired tokenization framework that creates structured, object-centric image representations through iterative, sequential encoding, enhancing interpretability and reasoning in vision models.
Contribution
The paper proposes a novel, communication-inspired tokenization method that constructs structured, object-centric image representations via iterative encoding within a transformer architecture.
Findings
Improves object-centric interpretability of image tokens.
Enhances compositional generalization and relational reasoning.
Outperforms prior methods in structured visual representation tasks.
Abstract
Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Face Recognition and Perception
