InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs
Lv Tang, Tianyi Zheng, Bo Li, Xingyu Li

TL;DR
This paper introduces InfoTok, an information-theoretic regularization method for visual tokenization in unified multimodal large language models, balancing compression and task relevance to improve performance.
Contribution
It proposes a novel information bottleneck-based tokenization mechanism that explicitly controls information flow, enhancing both understanding and generation in MLLMs.
Findings
InfoTok improves performance in image understanding and generation tasks.
It enforces a principled trade-off between compression and task relevance.
The method integrates into existing models without additional training data.
Abstract
Unified multimodal large language models (MLLMs) aim to unify image understanding and image generation within a single framework, where a shared visual tokenizer serves as the sole interface that maps high-dimensional images into a limited token budget for downstream multimodal reasoning and synthesis. However, existing shared-token designs are largely architecture-driven and lack an explicit criterion for what information should be preserved to simultaneously support semantic abstraction and visual detail. In this paper, we adopt a capacity-constrained perspective, viewing the shared tokenizer as a compute-bounded learner whose finite representational budget should prioritize reusable structure over hard-to-exploit high-entropy variations and redundancy. Motivated by this view, we propose \textbf{\textit{InfoTok}}, an information-regularized tokenization mechanism grounded in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
