Analyzing The Language of Visual Tokens
David M. Chan, Rodolfo Corona, Joonyong Park, Cheol Jun Cho, Yutong, Bai, Trevor Darrell

TL;DR
This paper investigates the statistical properties of visual tokens in transformer-based models, revealing similarities to natural language distributions but also fundamental differences in grammar and hierarchy, informing future model design.
Contribution
It provides a comprehensive analysis of the statistical and structural properties of visual languages in transformer models, highlighting key differences from natural languages.
Findings
Visual languages follow Zipfian distributions similar to natural languages.
Tokens mainly represent object parts at an intermediate granularity.
Visual languages lack cohesive grammatical structures, resulting in higher perplexity.
Abstract
With the introduction of transformer-based models for vision and language tasks, such as LLaVA and Chameleon, there has been renewed interest in the discrete tokenized representation of images. These models often treat image patches as discrete tokens, analogous to words in natural language, learning joint alignments between visual and human languages. However, little is known about the statistical behavior of these visual languages - whether they follow similar frequency distributions, grammatical structures, or topologies as natural languages. In this paper, we take a natural-language-centric approach to analyzing discrete visual languages and uncover striking similarities and fundamental differences. We demonstrate that, although visual languages adhere to Zipfian distributions, higher token innovation drives greater entropy and lower compression, with tokens predominantly…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper is well-oragnized and offers a fresh perspective by treating visual tokens as discrete elements analogous to words in natural language. 2. The experiments are well-executed, with thorough empirical analysis across several datasets and tokenization methods. 3. The work has significant implications for multimodal model design, suggesting that unique features of visual tokens may require new model designs for better performance in vision-language tasks.
1. While the paper evaluates various tokenization methods (e.g., VQ-VAE, Chameleon), it could benefit from exploring alternative tokenization strategies, especially non-discrete or hybrid methods. 2. The study primarily relies on commonly used datasets (e.g., MS-COCO, ImageNet) that may not fully capture the diversity and complexity of visual scenes in real-world multimodal applications. Including more varied datasets with richer visual and contextual details.
1. The idea of analyzing the visual tokens learned by VQ-VAE models and investigating their properties using natural language tools is interesting and could be inspiring. 2. The writing is clear, with each research question and findings clearly stated in each of the sections.
1. The paper presents various findings and suggests potential implications and directions for future work. However, it lacks follow-up experiments to support these hypotheses, making the claims less convincing and raising doubts about the practical value of the findings. 2. The approach of applying language tools directly to visual tokens is questionable, especially as the findings suggest that visual tokens may not inherently exhibit natural language structures. Additionally, it remains uncerta
1. By examining visual tokens through the lens of linguistic principles, the authors provide a novel framework for analyzing multimodal models, which enriches the discourse surrounding the integration of vision and language. 2. The comparison of visual languages to natural languages, especially in terms of grammatical structure and co-occurrence patterns, yields interesting conclusions about the nature of visual representation and its implications for model design. 3. The paper writing is clear.
1. The paper primarily borrows conclusions and formulas from the realm of language models and applies them to visual tokens without offering substantial new insights specific to visual representations. The analysis and discussion often appear superficial, failing to yield novel conclusions regarding the unique characteristics of visual tokens. 2. The study lacks original analytical frameworks or targeted statistical experiments designed specifically for visual tokens. As a result, the manuscript
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Handwritten Text Recognition Techniques
MethodsALIGN
