A More Word-like Image Tokenization for MLLMs

Hyun Lee; Hyemin Jeong; Yejin Kim; Hyungwook Choi; Hyunsoo Cho; Soo Kyung Kim; Joonseok Lee

arXiv:2605.17954·cs.CV·May 19, 2026

A More Word-like Image Tokenization for MLLMs

Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee

PDF

1 Repo

TL;DR

This paper introduces DiVT, a novel visual tokenization method that clusters image patches into semantic units, improving multimodal model efficiency and compatibility with language models.

Contribution

DiVT clusters visual embeddings into semantic tokens and adapts token count based on image complexity, enhancing efficiency without retraining core models.

Findings

01

DiVT matches or surpasses baseline performance with fewer tokens.

02

It reduces memory cost and latency significantly.

03

Demonstrates robustness under limited token budgets.

Abstract

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

snuviplab/DiVT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.