Unified Multimodal Understanding via Byte-Pair Visual Encoding

Wanpeng Zhang; Yicheng Feng; Hao Luo; Yijiang Li; Zihao Yue; Sipeng Zheng; Zongqing Lu

arXiv:2506.23639·cs.CV·July 1, 2025

Unified Multimodal Understanding via Byte-Pair Visual Encoding

Wanpeng Zhang, Yicheng Feng, Hao Luo, Yijiang Li, Zihao Yue, Sipeng Zheng, Zongqing Lu

PDF

Open Access

TL;DR

This paper introduces a unified multimodal understanding framework that applies byte-pair encoding to visual tokens, enhancing cross-modal reasoning and performance in vision-language tasks.

Contribution

It proposes a novel byte-pair visual encoding method with priority-guided encoding and curriculum-driven training, unifying visual and textual representations in multimodal models.

Findings

01

Improved performance on diverse vision-language tasks

02

Better cross-modal relationship modeling

03

Enhanced visual token structural encoding

Abstract

Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal understanding by applying byte-pair encoding to visual tokens. Unlike conventional approaches that rely on modality-specific encoders, our method directly incorporates structural information into visual tokens, mirroring successful tokenization strategies in text-only language models. We introduce a priority-guided encoding scheme that considers both frequency and spatial consistency, coupled with a multi-stage training procedure based on curriculum-driven data composition. These enhancements enable the transformer model to better capture cross-modal relationships and reason with visual information. Comprehensive experiments demonstrate improved performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques