Law of Vision Representation in MLLMs
Shijia Yang, Bohan Zhai, Quanzeng You, Jianbo Yuan, Hongxia Yang, Chenfeng Xu

TL;DR
This paper introduces the 'Law of Vision Representation' in multimodal large language models, showing a strong correlation between vision alignment, representation correspondence, and model performance, enabling cost-effective optimization.
Contribution
It uncovers a linear relationship between the AC score and model performance, allowing efficient vision representation training without frequent language model finetuning.
Findings
AC score linearly correlates with model performance
Optimal vision representation reduces training costs by 99.7%
Method validated across multiple benchmarks and settings
Abstract
We present the "Law of Vision Representation" in multimodal large language models (MLLMs). It reveals a strong correlation between the combination of cross-modal alignment, correspondence in vision representation, and MLLM performance. We quantify the two factors using the cross-modal Alignment and Correspondence score (AC score). Through extensive experiments involving thirteen different vision representation settings and evaluations across eight benchmarks, we find that the AC score is linearly correlated to model performance. By leveraging this relationship, we are able to identify and train the optimal vision representation only, which does not require finetuning the language model every time, resulting in a 99.7% reduction in computational cost.
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper identifies the limitations in current methods for choosing vision representations, highlighting the need for a deeper understanding rather than an empirical, trial-based approach. 2. The proposal of the Law of Vision Representation is an innovative concept that attempts to quantify the relationship between cross-modal alignment, correspondence in vision representation, and MLLM performance through the AC score. 3. The paper highlights an approach that achieves computational efficie
While this work raises an important question in MLLM, there are certain weaknesses that prevent me from assigning a higher score. 1. The experiments in this paper are conducted on a specific MLLM setup, with a particular LLM and projector configuration, which raises doubts about whether the conclusions can be applied if the LLM or projector settings are changed. Given that the paper aims to establish a "law of vision representation," the current experiments seem too limited in scope to substanti
1. The experiments are well-conducted and quite comprehensive. 2. The analyses of A, C, and AC scores, as well as their correlation with model performances are quite clear and valuable, highlighting the effectiveness of the proposed AC score.
1. How should we choose the number of $k'$ for Policy fitting? In Table 1, each benchmark has its own number of finetuning runs required to successfully predict the optimal vision representation. While 3 is the most common number across the benchmarks, there are still some outliers. It is difficult to determine the exact number of $k'$ for a benchmark without having the groundtruth. This suggests that the AC policy may not be very practical. 2. All the experiments are conducted under the conditi
1. AC policy consistently predicts the optimal vision representation with minimal resources. 2. Provide a insight of the choice and design of vision encoder for LVLM.
1. The calculation of the A score by using the image feature of CLIP might not fully fulfill the motivation for quantifying the cross-modal alignment, as it assumes that the image-text features are perfectly aligned in CLIP.
Code & Models
- 🤗shijiay/llava_clip224_stage1model
- 🤗shijiay/llava_clip224_stage2model· 1 dl1 dl
- 🤗shijiay/llava_clip_stage1model
- 🤗shijiay/llava_clip_stage2model· 3 dl3 dl
- 🤗shijiay/llava_openclip_stage1model· 2 dl2 dl
- 🤗shijiay/llava_openclip_stage2model· 2 dl2 dl
- 🤗shijiay/llava_dinov2_stage1model· 7 dl· ♡ 27 dl♡ 2
- 🤗shijiay/llava_dinov2_stage2model· 7 dl· ♡ 17 dl♡ 1
- 🤗shijiay/llava_sdim_stage1model· 1 dl1 dl
- 🤗shijiay/llava_sdim_stage2model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Semantic Web and Ontologies
