Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Davide Caffagni; Sara Sarto; Marcella Cornia; Lorenzo Baraldi; Pier Luigi Dovesi; Shaghayegh Roohi; Mark Granroth-Wilding; Rita Cucchiara

arXiv:2512.15885·cs.CV·December 19, 2025

Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, Rita Cucchiara

PDF

Open Access

TL;DR

JARVIS introduces a self-supervised visual learning framework for multimodal large language models, improving their visual reasoning and understanding by leveraging vision foundation models and structural regularities without relying solely on language supervision.

Contribution

The paper presents JARVIS, a novel self-supervised visual enhancement method that integrates I-JEPA into MLLMs, significantly boosting visual task performance without harming multimodal reasoning.

Findings

01

Improved performance on vision-centric benchmarks across various MLLMs.

02

Enhanced visual understanding without degrading multimodal reasoning.

03

Effective use of frozen vision models as context and target encoders.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling