Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training

Yangyi Chen; Hao Peng; Tong Zhang; Heng Ji

arXiv:2505.08971·cs.CV·May 15, 2025

Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training

Yangyi Chen, Hao Peng, Tong Zhang, Heng Ji

PDF

Open Access 1 Repo

TL;DR

PRIOR is a novel vision-language pre-training method that improves model accuracy by emphasizing image-related tokens during training, reducing noise and hallucination, and demonstrating significant performance gains.

Contribution

It introduces a token re-weighting approach based on importance sampling using a text-only reference model, enhancing vision-language pre-training effectiveness.

Findings

01

19% average relative improvement on benchmarks

02

8% average relative improvement without visual encoders

03

Higher scaling coefficients indicating better potential for future gains

Abstract

In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangyi-chen/prior
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecond Language Acquisition and Learning