Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

Enguang Wang; Qiang Wang; Yuanchen Wu; Ke Yan; Xinbin Yuan; Shouhong Ding; Xialei Liu; Ming-Ming Cheng

arXiv:2603.20808·cs.CV·March 24, 2026

Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models

Enguang Wang, Qiang Wang, Yuanchen Wu, Ke Yan, Xinbin Yuan, Shouhong Ding, Xialei Liu, Ming-Ming Cheng

PDF

Open Access

TL;DR

This paper identifies visual representation degradation in multimodal large language models caused by training objectives and proposes a predictive regularization method to preserve visual fidelity, improving overall vision-language performance.

Contribution

It introduces Predictive Regularization (PRe) to maintain visual features in MLLMs, addressing internal visual degradation and enhancing multimodal understanding.

Findings

01

Mitigating visual degradation improves vision-language task performance.

02

Degradation occurs in middle layers due to training objectives.

03

PRe effectively preserves visual features in internal representations.

Abstract

While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure. We attribute this phenomenon to a visual sacrifice driven by the singular text-generation objective, where the model compromises its visual fidelity to optimize for answer generation. We argue that a robust MLLM requires both strong cross-modal reasoning and core visual competence, and propose Predictive Regularization (PRe) to force degraded intermediate features to predict initial visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Language, Metaphor, and Cognition