Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yangyi Chen, Hao Peng, Tong Zhang, Heng Ji

TL;DR
PRIOR is a novel vision-language pre-training method that improves model accuracy by emphasizing image-related tokens during training, reducing noise and hallucination, and demonstrating significant performance gains.
Contribution
It introduces a token re-weighting approach based on importance sampling using a text-only reference model, enhancing vision-language pre-training effectiveness.
Findings
19% average relative improvement on benchmarks
8% average relative improvement without visual encoders
Higher scaling coefficients indicating better potential for future gains
Abstract
In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecond Language Acquisition and Learning
