Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

Yang Hu; Runchen Wang; Stephen Chong Zhao; Xuhui Zhan; Do Hun Kim; Mark Wallace; David A. Tovar

arXiv:2505.14204·cs.CV·May 21, 2025

Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment

Yang Hu, Runchen Wang, Stephen Chong Zhao, Xuhui Zhan, Do Hun Kim, Mark Wallace, David A. Tovar

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Perceptual-Initialization, a novel approach that incorporates human perceptual data during the initial training phase of vision encoders, leading to significant zero-shot performance improvements across multiple benchmarks without fine-tuning.

Contribution

It presents a new paradigm of using human perceptual structure at initialization, enhancing vision-language models' generalization and alignment without task-specific fine-tuning.

Findings

01

Significant zero-shot performance gains on 29 classification and 2 retrieval benchmarks.

02

Improvements in top-1 and top-5 accuracy, and retrieval recall metrics.

03

Effective across datasets of various scales and characteristics.

Abstract

We introduce Perceptual-Initialization (PI), a paradigm shift in visual representation learning that incorporates human perceptual structure during the initialization phase rather than as a downstream fine-tuning step. By integrating human-derived triplet embeddings from the NIGHTS dataset to initialize a CLIP vision encoder, followed by self-supervised learning on YFCC15M, our approach demonstrates significant zero-shot performance improvements, without any task-specific fine-tuning, across 29 zero shot classification and 2 retrieval benchmarks. On ImageNet-1K, zero-shot gains emerge after approximately 15 epochs of pretraining. Benefits are observed across datasets of various scales, with improvements manifesting at different stages of the pretraining process depending on dataset characteristics. Our approach consistently enhances zero-shot top-1 accuracy, top-5 accuracy, and…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 8Confidence 3

Strengths

- The proposed method is novel, simple yet effective. The promising results of the paper can encourage following researches exploring other initialization strategies. - Results in zero-shot image classification and retrieval tasks demonstrates that PI scales as the data volume increases, indicating the method's potential in large-scale training. - The paper is well organized and nicely presented. The ending section points out remaining challenges faithfully and offers valuable insights, strength

Weaknesses

- The proposed method limits its scope for the initialization of CLIP type model, despite that the human preference alignment is independent to the text encoder. The author could add experiments on other visual backbones such as vanilla ViTs to fully explore the potential of the method.

Reviewer 02Rating 6Confidence 4

Strengths

**originality**: Lveraging supervised human behavioural data as a foundational inductive bias in the model intialization is a novel idea that opens a new research direction. The works provided a provides a structured solution that converts often ignored variance of random initialiation into a principled prior. **Significance**: PI paradigm is the core strength of the paper. It uses the supervised human perceptual data to initialize a VLM parameter prior to large scale pretaining, provide a pot

Weaknesses

**Limited scope of the prior**: Only the vision encoder is initialized with PI and the text encoder is still randomly initialized and trained from scratch. What is the reason for this choice for the experiments? CLIP like model operates on the shared latent space of vision and text modalities. The paper could be strenthened by exploring complementary intialization of text encoder, to see if such complete model with PI initialization provides synergistic benefits. **Perceptual Loss**: The core

Reviewer 03Rating 6Confidence 5

Strengths

1. Novel use of human perceptual priors as initialization rather than alignment fine-tuning. 2. Comprehensive evaluation over diverse datasets shows consistent positive gains. 3. Very low additional compute cost. 4. Clear comparison showing that late perceptual fine-tuning disrupts alignment and opens new direction for human or brain aligned pretraining.

Weaknesses

1. No experiments using random or pseudo perceptual triplets to isolate the contribution of human perceptual structure. 2. The approach is validated only on NIGHTS; applicability to richer datasets remains untested. 3. No probing or visualization is provided to show how perceptual initialization changes internal feature space or similarity structure compared to the baseline.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCategorization, perception, and language · Language, Metaphor, and Cognition · Speech and dialogue systems

MethodsContrastive Language-Image Pre-training