Seeing Beyond Redundancy: Task Complexity's Role in Vision Token Specialization in VLLMs
Darryl Hannan, John Cooper, Dylan White, and Yijing Watkins

TL;DR
This paper investigates how task complexity influences visual information processing in vision large language models, revealing that higher complexity data promotes better visual representation and task performance.
Contribution
It introduces a synthetic benchmark and metrics to analyze visual redundancy, and demonstrates the impact of training data complexity on VLLMs' visual feature utilization.
Findings
Higher task complexity correlates with increased visual compression.
Training on complex visual data improves VLLMs' performance on fine-grained tasks.
Visual redundancy decreases as models are exposed to more complex visual information.
Abstract
Vision capabilities in vision large language models (VLLMs) have consistently lagged behind their linguistic capabilities. In particular, numerous benchmark studies have demonstrated that VLLMs struggle when fine-grained visual information or spatial reasoning is required. However, we do not yet understand exactly why VLLMs struggle so much with these tasks relative to others. Some works have focused on visual redundancy as an explanation, where high-level visual information is uniformly spread across numerous tokens and specific, fine-grained visual information is discarded. In this work, we investigate this premise in greater detail, seeking to better understand exactly how various types of visual information are processed by the model and what types of visual information are discarded. To do so, we introduce a simple synthetic benchmark dataset that is specifically constructed to…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper proposes and comprehensively applies multiple quantitative indicators (such as Gini coefficient, stable rank, participation rate, etc.) to systematically analyze visual information redundancy from the two levels of token norm and matrix rank, surpassing previous studies that only focused on attention distribution and providing a more comprehensive tool for understanding the internal visual information processing of VLLMs. - The experiments precisely controls variables through the c
- Some findings, such as "there is a connection between task complexity and visual compression", are similar with the conclusions given in previous works like PDrop[1]. - Fine-tuning experiments are only based on simplified subsets of COCO and GQA (such as objects with only "left-right" relationships), and more complex spatial relationships (such as spatial reasoning in ERQA) have not been tested, which may underestimate the model's redundant performance in real complex tasks. - The experimen
+ Authors propose a very detailed statistical analysis to uncover redundancy in the tokens representations within a LLM. A lot of the technique proposed could likelly be re-used for other works interested in uncovering more about the hidden representation of these models. + Surprising finding that in the experimental settings of the work fine tuning a model on visual data seems to overwhelmingly alter text representations while leaving vision representations largely unaltered. + Carefully cur
a. **Limited experimental analysis**: Most experiments of the paper are performed using only the Molmo MLLM, would have been interesting to see the analysis expanded to other models trained on different data mixtures and with different architectures. The paper does consider llama, but only for the experiments on probes and visual ablations. The analysis of other decoder based MLLM besides Olmo would have made this submission more strong. B. **Nice analysis, but limited applicability**: While th
1. Systematic metric design – The work proposes a comprehensive suite of metrics (both norm- and rank-based, plus SVD alignment) to analyze compression and redundancy in VLLMs’ hidden states, offering more granular insight than prior attention-based analyses. 2. Detailed layer-wise analysis – The visualization across layers for different metrics provides an interpretable picture of how visual information is redistributed within models. 3. Task complexity perspective – The link between downstre
1. Synthetic dataset reliance – The main analyses are conducted on a fully synthetic dataset designed by the authors, with limited validation of whether the findings generalize to real-world tasks. The COCO and GQA datasets used in the synthetic data experiments were also only analyzed using the metrics proposed in the paper, rather than through more intuitive computations of prediction accuracy. 2. Evaluation metric coverage vs accuracy gains – The paper heavily focuses on reporting compressio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
