Intriguing Properties of Large Language and Vision Models
Young-Jun Lee, Byungsoo Ko, Han-Gyu Kim, Yechan Hwang, Ho-Jin Choi

TL;DR
This paper systematically investigates large language and vision models (LLVMs), revealing their global image processing, partial reasoning abilities, overfitting in alignment, and the importance of early-layer representations, to guide future improvements.
Contribution
It provides a comprehensive analysis of LLVMs' perception and reasoning properties, highlighting their strengths and limitations, and suggests directions for future research and benchmark development.
Findings
LLVMs process images globally, regardless of patch order.
They can solve some math problems without detailed numerical perception.
Cross-modal alignment overfits to complex reasoning, affecting perceptual capabilities.
Abstract
Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA)…
Peer Reviews
Decision·Submitted to ICLR 2025
This paper is useful because it digs into how large language and vision models (LLVMs) really work with visual information. It shows that LLVMs are flexible, able to handle scrambled image pieces and solve math problems without all the visual details. It also highlights that when LLVMs are tuned for complex reasoning, they lose some basic visual skills. These findings can help make LLVMs better by balancing complex reasoning with simpler perception tasks, potentially guiding the creation of new,
The paper would benefit from more precise and carefully scoped conclusions. While the experiments provide interesting observations, the claims drawn from them are often overly broad. For example: 1) The claim about permutation invariance is based on VQA tasks, but this alone cannot support a general conclusion about LLVMs' visual processing capabilities. 2) The benchmarks used don't adequately test basic visual skills to support such sweeping statements about visual understanding Specificity Ne
* The paper addresses an important and timely problem that is of interest to the VLM community. * Some observations made by the authors are insightful and have the potential to contribute to the understanding of VLMs. * The set of analysis is diverse and has the potential to inform the development of a new set of architectures/benchmark. * Detailed set of benchmark datasets in the Appendix.
* Many conclusions are not clearly backed up by evidence or rigorous analysis, which undermines their validity (I've given some examples in the questions section) * Some observations appear to be cherry-picked (by dataset or sample?), which raises concerns about the representativeness and generalizability of the results. For instance, the authors want to evaluate the LLaVA family of models but some results are given on a single model and data sample while the conclusion is general (Figure 1 and
The paper comprehensively evaluated several VLMs across various perception and reasoning benchmark to better understand their behavior and performance. These empirical findings could be valuable providing insights for model developments.
1. While the paper shows some good empirical studies of the LVM performance, they are loosely connected to the suggested future directions. This makes the contribution of the paper less clear. For example, the authors suggested "deeply consider innovative model architectures" in Sec 4 for enhancing cross modal alignment, yet it's not clear how this related to the empirical findings discussed in the previous sections and what necessary enhancements are entailed from their analysis. 2. The author
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies
MethodsSoftmax · Attention Is All You Need
