Boosting Visual Instruction Tuning with Self-Supervised Guidance

Sophia Sirko-Galouchenko; Monika Wysoczanska; Andrei Bursuc; Nicolas Thome; Spyros Gidaris

arXiv:2604.12966·cs.CV·April 15, 2026

Boosting Visual Instruction Tuning with Self-Supervised Guidance

Sophia Sirko-Galouchenko, Monika Wysoczanska, Andrei Bursuc, Nicolas Thome, Spyros Gidaris

PDF

1 Repo

TL;DR

This paper introduces a lightweight method to enhance multimodal large language models' visual reasoning by incorporating self-supervised, visually grounded tasks into instruction tuning, improving performance without architectural changes.

Contribution

It reformulates classical self-supervised tasks as natural language instructions to better utilize visual information during instruction tuning.

Findings

01

Injecting 3-10% of visually grounded instructions improves vision-centric task performance.

02

The approach requires no additional annotations or architectural modifications.

03

Performance gains are consistent across multiple models and benchmarks.

Abstract

Multimodal large language models (MLLMs) perform well on many vision-language tasks but often struggle with vision-centric problems that require fine-grained visual reasoning. Recent evidence suggests that this limitation arises not from weak visual representations, but from under-utilization of visual information during instruction tuning, where many tasks can be partially solved using language priors alone. We propose a simple and lightweight approach that augments visual instruction tuning with a small number of visually grounded self-supervised tasks expressed as natural language instructions. By reformulating classical self-supervised pretext tasks, such as rotation prediction, color matching, and cross-view correspondence, as image-instruction-response triplets, we introduce supervision that cannot be solved without relying on visual evidence. Our approach requires no human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sirkosophia/V-GIFT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.