Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu; Hao Fei; Xiangtai Li; Jiayi Ji; Hanwang Zhang; Tat-Seng; Chua; Shuicheng Yan

arXiv:2406.05127·cs.CV·February 27, 2025·2 cites

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng, Chua, Shuicheng Yan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SeTok, a dynamic vision tokenizer for multimodal large language models that preserves semantic integrity of visual features, improving performance across vision-language tasks.

Contribution

The paper proposes a novel dynamic clustering-based vision tokenizer, SeTok, which better preserves semantic information compared to existing methods.

Findings

01

SeTok effectively maintains semantic integrity of visual features.

02

Setokim with SeTok outperforms existing models on multiple tasks.

03

The dynamic clustering adapts to image complexity for optimal tokenization.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1.This paper points out that the main bottleneck of current MLLMs is visual tokenization method, which cannot effectively align semantic and image features. Current visual tokenization methods only divide the image into a fixed number of patches, which not only destroys the visual context information, but also makes it difficult to align the visual semantic units. 2.During SeTok training, this work introduces concept-level text-image contrastive loss and image reconstruction loss. The former en

Weaknesses

1.Table 2 compares the performance of SETOKIM method and some similar methods under the task of zero-shot text-to-image generation and editing under several data sets. The results show that the indicators of some data sets are not optimal, and there is still some gap with SOTA.

Reviewer 02Rating 8Confidence 4

Strengths

- The motivation of dynamic SeTok makes a lot of sense. As the paper mentioned, fragment visual input can corrupt the visual semantic integrity, which is a long-lasting problem in visual tokenization. This paper provides a simple solution though clustering but exhibit high performance, which could inspire future works. - Sufficient experimental evaluation. Various tasks are included, like referring expression segmentation, visual understanding, and text-to-image generation. This could take much

Weaknesses

- Unfair comparison. 1) The model is finetuned on a lot of multimodal instruction datasets, whereas the compared works, like the ones in Tab. 1, use smaller data sizes for finetuning. Performing comparison on different training data hardly demonstrates the priority. 2) The visual encoder introduces more parameters for clustering (totally 20 transformer layers). 3) A more advancing pretrained encoder SigLIP-384 is used. In conclusion, many settings, including the data, the model parameters, and t

Reviewer 03Rating 5Confidence 5

Strengths

1. The proposed SeTok is interesting and valuable. 2. The experimental details in the paper are thorough, ensuring reproducibility.

Weaknesses

1. The performance of MLLMs is greatly influenced by SFT data and LLMs. Therefore, when comparing image understanding capabilities, it is important to make sure under the same settings, such as using the same settings as LLaVA, to better reflect the effectiveness of the proposed methods. However, there is lack of this comparison in the experiments. 2. In the comparison of image understanding capabilities, there is a lack of evaluation on more fine-grained benchmarks such as TextVQA, OCRBench, SE

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Multi-Agent Systems and Negotiation