Law of Vision Representation in MLLMs

Shijia Yang; Bohan Zhai; Quanzeng You; Jianbo Yuan; Hongxia Yang; Chenfeng Xu

arXiv:2408.16357·cs.CV·October 7, 2025

Law of Vision Representation in MLLMs

Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu

PDF

Open Access 1 Repo 10 Models 3 Reviews

TL;DR

This paper introduces the 'Law of Vision Representation' in multimodal large language models, showing a strong correlation between vision alignment, representation correspondence, and model performance, enabling cost-effective optimization.

Contribution

It uncovers a linear relationship between the AC score and model performance, allowing efficient vision representation training without frequent language model finetuning.

Findings

01

AC score linearly correlates with model performance

02

Optimal vision representation reduces training costs by 99.7%

03

Method validated across multiple benchmarks and settings

Abstract

We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The paper identifies the limitations in current methods for choosing vision representations, highlighting the need for a deeper understanding rather than an empirical, trial-based approach. 2. The proposal of the Law of Vision Representation is an innovative concept that attempts to quantify the relationship between cross-modal alignment, correspondence in vision representation, and MLLM performance through the AC score. 3. The paper highlights an approach that achieves computational efficie

Weaknesses

While this work raises an important question in MLLM, there are certain weaknesses that prevent me from assigning a higher score. 1. The experiments in this paper are conducted on a specific MLLM setup, with a particular LLM and projector configuration, which raises doubts about whether the conclusions can be applied if the LLM or projector settings are changed. Given that the paper aims to establish a "law of vision representation," the current experiments seem too limited in scope to substanti

Reviewer 02Rating 6Confidence 3

Strengths

1. The experiments are well-conducted and quite comprehensive. 2. The analyses of A, C, and AC scores, as well as their correlation with model performances are quite clear and valuable, highlighting the effectiveness of the proposed AC score.

Weaknesses

1. How should we choose the number of $k'$ for Policy fitting? In Table 1, each benchmark has its own number of finetuning runs required to successfully predict the optimal vision representation. While 3 is the most common number across the benchmarks, there are still some outliers. It is difficult to determine the exact number of $k'$ for a benchmark without having the groundtruth. This suggests that the AC policy may not be very practical. 2. All the experiments are conducted under the conditi

Reviewer 03Rating 6Confidence 3

Strengths

1. AC policy consistently predicts the optimal vision representation with minimal resources. 2. Provide a insight of the choice and design of vision encoder for LVLM.

Weaknesses

1. The calculation of the A score by using the image feature of CLIP might not fully fulfill the motivation for quantifying the cross-modal alignment, as it assumes that the image-text features are perfectly aligned in CLIP.

Code & Models

Repositories

bronyayang/law_of_vision_representation_in_mllms
jaxOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Semantic Web and Ontologies