TL;DR
This paper introduces a lightweight method to enhance multimodal large language models' visual reasoning by incorporating self-supervised, visually grounded tasks into instruction tuning, improving performance without architectural changes.
Contribution
It reformulates classical self-supervised tasks as natural language instructions to better utilize visual information during instruction tuning.
Findings
Injecting 3-10% of visually grounded instructions improves vision-centric task performance.
The approach requires no additional annotations or architectural modifications.
Performance gains are consistent across multiple models and benchmarks.
Abstract
Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
