Unified Pix Token And Word Token Generative Language Model
Haun Leung, ZiNan Wang

TL;DR
This paper introduces a unified model combining pixel and word tokens for generative language tasks, enhancing visual understanding especially for small details, with promising results in unsupervised pretraining.
Contribution
The novel model unifies pixel and word tokens, incorporating features like color folding and global attention, improving visual detail recognition in generative models.
Findings
Good performance in small models with limited data
Performance likely improves with larger models and more data
Addresses limitations in visual understanding for small text and numbers
Abstract
Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
