Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Yaxin Luo, Zhiqiang Shen

TL;DR
This paper demonstrates that a simple bridge training stage can effectively adapt large language models for vision tasks, challenging the belief that language and vision models are incompatible due to parameter differences.
Contribution
Introducing random label bridge training as a modality adaptation method that aligns language models with vision tasks without manual labeling.
Findings
Bridge training effectively aligns LLMs with vision tasks.
Partial bridge training retains beneficial properties of certain LLM layers.
No manual labeling required for the proposed adaptation method.
Abstract
The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
