Unified Pix Token And Word Token Generative Language Model

Haun Leung; ZiNan Wang

arXiv:2605.14028·cs.CV·May 15, 2026

Unified Pix Token And Word Token Generative Language Model

Haun Leung, ZiNan Wang

PDF

TL;DR

This paper introduces a unified model combining pixel and word tokens for generative language tasks, enhancing visual understanding especially for small details, with promising results in unsupervised pretraining.

Contribution

The novel model unifies pixel and word tokens, incorporating features like color folding and global attention, improving visual detail recognition in generative models.

Findings

01

Good performance in small models with limited data

02

Performance likely improves with larger models and more data

03

Addresses limitations in visual understanding for small text and numbers

Abstract

Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.