TL;DR
This paper introduces DiVT, a novel visual tokenization method that clusters image patches into semantic units, improving multimodal model efficiency and compatibility with language models.
Contribution
DiVT clusters visual embeddings into semantic tokens and adapts token count based on image complexity, enhancing efficiency without retraining core models.
Findings
DiVT matches or surpasses baseline performance with fewer tokens.
It reduces memory cost and latency significantly.
Demonstrates robustness under limited token budgets.
Abstract
Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
