Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Pier Luigi Dovesi, Shaghayegh Roohi, Mark Granroth-Wilding, Rita Cucchiara

TL;DR
JARVIS introduces a self-supervised visual learning framework for multimodal large language models, improving their visual reasoning and understanding by leveraging vision foundation models and structural regularities without relying solely on language supervision.
Contribution
The paper presents JARVIS, a novel self-supervised visual enhancement method that integrates I-JEPA into MLLMs, significantly boosting visual task performance without harming multimodal reasoning.
Findings
Improved performance on vision-centric benchmarks across various MLLMs.
Enhanced visual understanding without degrading multimodal reasoning.
Effective use of frozen vision models as context and target encoders.
Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
