From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing,, Sipeng Zheng, Zongqing Lu

TL;DR
This paper presents a novel image tokenizer using Byte-Pair Encoding for visual data, improving multimodal understanding in large language models by integrating structural information directly into image tokens.
Contribution
The paper introduces a BPE-based image tokenizer that incorporates structural priors, enabling more effective multimodal learning without relying on separate visual encoders.
Findings
Enhanced multimodal understanding demonstrated by experiments
Superior performance of Being-VL-0 across benchmarks
Effective learning with limited training data
Abstract
Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by applying the principle of Byte-Pair Encoding (BPE) to visual data. Unlike conventional approaches that rely on separate visual encoders, our method directly incorporates structural prior information into image tokens, mirroring the successful tokenization strategies used in text-only Large Language Models. This innovative approach enables Transformer models to more effectively learn and reason across modalities. Through theoretical analysis and extensive experiments, we demonstrate that our BPE Image Tokenizer significantly enhances MLLMs' multimodal understanding capabilities, even with limited training data. Leveraging this method, we develop…
Peer Reviews
Decision·ICLR 2025 Poster
1. From the results shown to us, there is some improvement. 2. Experiment settings are clear
1. Performance is poor compared to any CLIP style or even DINO style MLLM as the visual encoder. 2. There is no projector in the experiments. This could be an extreme unfair setting compared to classical pipeline. 3. I do not think proofs are helpful to understand what is going on in the experiments.
1. This BPE image tokenization approach is novel and that potentially help the transformer better understand alignment between text and image with a semantic image token. 2. There is a theoretical analysis on how BPE tokenize benefits transformers learning in Section 3. 3. The scaling of BPE is reflected in that the model improves when adding larger scale of data such as ShareGPT4, etc.
1. The experimental evidences are kind of weak. First, it's far behind current MLLMs SOTA on public benchmarks. For example, the best presented number of proposed model is LLM+VQ+BPE with Additional scaling (SFT) , which achieves 60.6 on VQAv2, 44.0 on MMBench, and 48.2 on VizWiz, which is far behind similar size 7B LLaMA-based MLLMs. 2. Second, the ablation is not sufficient to show the benefit of BPE image tokenizer. Only one Table results compare LLM+VQ and LLM+VQ+BPE. The details of these tw
1.This paper creatively adapts byte-pair encoding (BPE) for images, aiming to make visual data work more seamlessly with text in multimodal models. 2.The approach integrates structural information directly into image tokens, which could help models better understand and align visuals with text, showing solid potential in cross-modal tasks.
1.Theoretical framework has several notable limitations: 1.1Lack of Multimodal Fusion Analysis: The paper’s theoretical analysis is focused on 2D image data alone and does not delve into how the BPE tokenizer facilitates the fusion of visual and textual information. Multimodal tasks typically require deep semantic and structural alignment across modalities, which is not sufficiently addressed in this analysis. This omission limits the theoretical support for the tokenizer’s efficacy in a multi
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Digital Humanities and Scholarship · Natural Language Processing Techniques
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
