TL;DR
This paper reveals that in multimodal large language models, multiple vision encoders often provide redundant information, and selectively masking encoders can improve performance and efficiency, challenging the assumption that more encoders always enhance results.
Contribution
The study introduces two metrics, CUR and IG, to quantify encoder utility and redundancy, providing new insights into encoder specialization and redundancy in multimodal models.
Findings
Encoders show high redundancy on general tasks, allowing for simpler models.
Masking certain encoders can improve task performance and reduce resources.
Some encoders negatively impact performance, indicating detrimental effects.
Abstract
Recent multimodal large language models (MLLMs) increasingly integrate multiple vision encoders to improve performance on various benchmarks, assuming that diverse pretraining objectives yield complementary visual signals. However, we show this assumption often fails in practice. Through systematic encoder masking across representative multi encoder MLLMs, we find that performance typically degrades gracefully, and sometimes even improves, when selected encoders are masked, revealing pervasive encoder redundancy. To quantify this effect, we introduce two principled metrics: the Conditional Utilization Rate (CUR), which measures an encoder s marginal contribution in the presence of others, and the Information Gap (IG), which captures heterogeneity in encoder utility within a model. Using these tools, we observe: (i) strong specialization on tasks like OCR and Chart, where a single…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper tackles an underexplored but highly relevant question in multimodal LLM design — encoder redundancy. While prior works focused on adding more encoders or improving fusion, this paper challenges the “more is better” assumption and provides a new analytical perspective. 2. The introduction of Conditional Utilization Rate (CUR) and Information Gap (IG) offers principled and interpretable metrics for quantifying encoder contribution and redundancy, enabling future researchers to diagnos
1. While the paper introduces CUR and IG as empirical metrics, it does not offer a strong theoretical framework explaining why redundancy arises or how encoder representations overlap in feature space. 2. The study primarily measures the effect of removing one encoder at a time (via single-encoder masking). This ignores higher-order interactions — for example, two encoders might each seem redundant individually but provide complementary information together. 3. Can the author provide the analysi
1. The paper is well-written and metrics are well-defined. Easy to follow. 2. This paper challenges the view on vision encoder selection for open mllms, and provides a fresher perspective on inference selection strategy. 3. The evaluation setups are extensive, which back up their claim about inference not needing all encoders.
1. This paper feels also related to visual token selection strategy, I think the paper should include relevant references in related work section. 2. In the final paragraph of the introduction, the paper also mentions "in our setup, fine-tuning a dualencoder variant is 1.69× faster than its five-encoder counterpart.", however, I did not find experiments about finetuning in the paper. 3. To further show the merit of reducing vision encoders during inference, I believe it's better to also include
1. Problem framing and metrics. Treats redundancy in multi-encoder MLLMs as a first-class research object and introduces two reusable, precisely defined measures (CUR) and (IG) that turn a vague intuition (“more encoders help”) into testable quantities.
1. Lack of Constructive Improvements. While the proposed CUR and IG metrics help quantify encoder redundancy, the contribution appears incremental. Prior works such as Eagle have already observed and empirically analyzed similar redundancy phenomena (Fig. 4 in their paper). This paper mainly reformulates those observations into statistical metrics based on existing accuracy measures, without offering causal insights into why redundancy arises, e.g., whether it stems from overlapping feature spac
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
