Revisiting MLLM Token Technology through the Lens of Classical Visual Coding

Jinming Liu; Junyan Lin; Yuntao Wei; Kele Shao; Keda Tao; Jianguo Huang; Xudong Yang; Zhibo Chen; Huan Wang; Xin Jin

arXiv:2508.13460·cs.CV·August 20, 2025

Revisiting MLLM Token Technology through the Lens of Classical Visual Coding

Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin

PDF

TL;DR

This paper reexamines MLLM token technology using classical visual coding principles, establishing a unified framework for comparison and exploring how insights from visual coding can improve multimodal models and visual codecs.

Contribution

It provides the first comprehensive, structured comparison between MLLM token technology and visual coding, integrating principles from both fields for enhanced efficiency and robustness.

Findings

01

Unified formulation bridging token technology and visual coding

02

Insights for improving MLLM token efficiency and robustness

03

Guidance for designing next-generation semantic visual codecs

Abstract

Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology, including tokenization, token compression, and token reasoning, through the established principles of long-developed visual coding area. From this perspective, we (1) establish a unified formulation bridging token technology and visual coding, enabling a systematic, module-by-module comparative analysis; (2) synthesize bidirectional insights, exploring how visual coding principles can enhance MLLM token techniques' efficiency and robustness, and conversely, how token technology paradigms can inform the design of next-generation semantic visual codecs; (3) prospect for promising future research directions and critical unsolved challenges. In summary, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.